Commit Graph

1099 Commits

Author SHA1 Message Date
Cheng Pan
5b3f43dffc
[CELEBORN-763] Add --add-opens to bootstrap shell scripts
### What changes were proposed in this pull request?

Add --add-opens to bootstrap shell scripts

### Why are the changes needed?

Additional `--add-opens` is required for Java 17, notes, the `--add-opens` list is copied from Spark and was used for UT, I am not sure each of them is required but at least the UT passed with them.

Details supplied by cfmcgrady

[JEP 403](https://openjdk.java.net/jeps/403) targeted for [JDK 17](https://openjdk.java.net/projects/jdk/17/) will remove `--illegal-access` flag. That will be equivalent to `--illegal-access=deny`.

this means using reflection to invoke protected methods of exported `java.*` APIs will no longer work. For example:

```shell
> /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/bin/jshell
|  欢迎使用 JShell -- 版本 17.0.7
|  要大致了解该版本, 请键入: /help intro

jshell> java.nio.ByteBuffer direct = java.nio.ByteBuffer.allocateDirect(1);
direct ==> java.nio.DirectByteBuffer[pos=0 lim=1 cap=1]

jshell> direct.getClass().getDeclaredConstructor(long.class, int.class).setAccessible(true);
|  异常错误 java.lang.reflect.InaccessibleObjectException:Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module 34c45dca
|        at AccessibleObject.checkCanSetAccessible (AccessibleObject.java:354)
|        at AccessibleObject.checkCanSetAccessible (AccessibleObject.java:297)
|        at Constructor.checkCanSetAccessible (Constructor.java:188)
|        at Constructor.setAccessible (Constructor.java:181)
|        at (#2:1)

jshell>

```

```shell
>  /Library/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/bin/jshell -R --add-opens=java.base/java.nio=ALL-UNNAMED
|  欢迎使用 JShell -- 版本 17.0.7
|  要大致了解该版本, 请键入: /help intro

jshell> java.nio.ByteBuffer direct = java.nio.ByteBuffer.allocateDirect(1);
direct ==> java.nio.DirectByteBuffer[pos=0 lim=1 cap=1]

jshell> direct.getClass().getDeclaredConstructor(long.class, int.class).setAccessible(true);

jshell>
```

### Does this PR introduce _any_ user-facing change?

Yes, for Java 17 support.

### How was this patch tested?

CI and review

Closes #1677 from pan3793/CELEBORN-763.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-05 11:31:21 +08:00
Cheng Pan
de0fd8cc44 [CELEBORN-762] Always set JVM opts -XX:+IgnoreUnrecognizedVMOptions
### What changes were proposed in this pull request?

Always set JVM opts `-XX:+IgnoreUnrecognizedVMOptions`

### Why are the changes needed?

By default, JVM failed to start when unknown opts are set, it's not friendly for users who want to use different versions of JDK.

### Does this PR introduce _any_ user-facing change?

Yes, users can success start celeborn even if they provide unknown JVM opts.

### How was this patch tested?

Review.

Closes #1676 from pan3793/CELEBORN-762.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-04 21:37:19 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Cheng Pan
26aaba14d4 [CELEBORN-637][FOLLOWUP] Mention configurations change in migration guide
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

mention configuration behavior change in migration guide

### Does this PR introduce _any_ user-facing change?

Yes, the migration guide is updated

### How was this patch tested?

review

Closes #1673 from pan3793/CELEBORN-637-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 14:26:43 +08:00
Fu Chen
3964861fd7
[CELEBORN-726][FOLLOWUP] Eliminate 'TODO' comments within the Controller class
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

eliminate comments introduced in https://github.com/apache/incubator-celeborn/pull/1650

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1672 from cfmcgrady/primary-replica-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-03 13:40:30 +08:00
Cheng Pan
a1be02b4fa [CELEBORN-757] Improve metrics method signature and code style
### What changes were proposed in this pull request?

- gauge method definition improvement. i.e.

  before
  ```
  def addGauge[T](name: String, f: Unit => T, labels: Map[String, String])
  ```
  after
  ```
  def addGauge[T](name: String, labels: Map[String, String])(f: () => T)
  ```
  which improves the caller-side code style
  ```
  addGauge(name, labels) { () =>
    ...
  }
  ```

- remove unnecessary Java/Scala collection conversion. Since Scala 2.11 does not support SAM, the extra implicit function is required.

- leverage Logging to defer message evaluation

- UPPER_CASE string constants

### Why are the changes needed?

Improve code quality and performance(maybe)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1670 from pan3793/CELEBORN-757.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 11:56:43 +08:00
Angerszhuuuu
7880c52fff [CELEBORN-745] Match TransportMessage type use number instead of enum
### What changes were proposed in this pull request?
Match TransportMessage type use number not enum to support change MessageType name,after this pr, then we can change the MessageType name.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1658 from AngersZhuuuu/CELEBORN-745.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-01 18:50:02 +08:00
zky.zhoukeyong
af0f5e5784 [CELEBORN-755][FOLLOWUP] Avoid unnecessary memory copy when compression disabled
…sion disabled

### What changes were proposed in this pull request?
Avoid memory copy for code path where compression is disabled. Followup of https://github.com/apache/incubator-celeborn/pull/1669

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #1671 from waitinfuture/755.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-01 18:27:33 +08:00
xiyu.zk
381165d4e7
[CELEBORN-755] Support disable shuffle compression
### What changes were proposed in this pull request?
Support to decide whether to compress shuffle data through configuration.

### Why are the changes needed?
Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation.

### Does this PR introduce _any_ user-facing change?
no.

Closes #1669 from kerwin-zk/celeborn-755.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-01 00:03:50 +08:00
Fu Chen
047e90b17b
[CELEBORN-756][WORKER] Refactor PushDataHandler class to utilize while loop
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, use `while` loop for performance-sensitive code

worker's flame graph before:

![截屏2023-06-30 下午5 58 02](https://github.com/apache/incubator-celeborn/assets/8537877/28c199b6-a29b-4501-8064-e0f2ddb2a8b9)

after:

![截屏2023-06-30 下午5 58 18](https://github.com/apache/incubator-celeborn/assets/8537877/c6134959-5f78-436b-aa29-a78882b09e84)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1668 from cfmcgrady/while-loop-2.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 18:13:53 +08:00
Angerszhuuuu
5c7ecb8302
[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future
### What changes were proposed in this pull request?
Provide a new SparkShuffleManager to replace RssShuffleManager in the future

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1667 from AngersZhuuuu/CELEBORN-754.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 17:27:33 +08:00
Angerszhuuuu
6e35745736
[CELEBORN-753] Rename spark patch file name to make it more clear
### What changes were proposed in this pull request?
Rename spark patch file name to make it more clear

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1666 from AngersZhuuuu/CELEBORN-753.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-30 11:41:12 +08:00
mingji
742815f285
[CELEBORN-749] Update grafana dashboard to remove "RSS"
### What changes were proposed in this pull request?
Update Grafana dashboard and its setup demo to remove the old name "RSS"

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
No test needed.

Closes #1663 from FMX/CELEBORN-749.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 20:44:09 +08:00
Cheng Pan
df3fc194fb
[CELEBORN-744] Add Benchmark framework and ComputeIfAbsentBenchmark
### What changes were proposed in this pull request?

The benchmark shows that `computeIfAbsent` still has better performance on simple case

```
================================================================================================
HashMap
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1
Apple M1 Pro
HashMap:                                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
putIfAbsent                                         701            702           0         95.7          10.4       1.0X
computeIfAbsent                                     534            535           1        125.6           8.0       1.3X

================================================================================================
ConcurrentHashMap
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1
Apple M1 Pro
ConcurrentHashMap:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
putIfAbsent                                         712            716           3         94.2          10.6       1.0X
computeIfAbsent                                     702            705           2         95.6          10.5       1.0X
```

### Why are the changes needed?

Introduce a Benchmark framework for future performance sensitive case measurement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1657 from pan3793/CELEBORN-744.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 20:19:30 +08:00
Fu Chen
baa0d0b3b4
[CELEBORN-750] Simplify get Netty used direct memory logic
### What changes were proposed in this pull request?

netty has exposed the public API `PlatformDependent.usedDirectMemory()` to get netty used direct memory since [netty-4.1.35.Final](https://github.com/netty/netty/pull/8945), simplifies the logic

### Why are the changes needed?

simplifies the get netty used direct memory logic

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1662 from cfmcgrady/netty-used-memory.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-29 18:24:16 +08:00
Fu Chen
adbd38a926
[CELEBORN-726][FOLLOWUP] Update data replication terminology from master/slave to primary/replica in the codebase
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1639 from cfmcgrady/primary-replica.

Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 17:07:26 +08:00
Angerszhuuuu
1fd8833756
[CELEBORN-748] Rename RssHARetryClient to MasterClient
### What changes were proposed in this pull request?

Rename RssHARetryClient to MasterClient

### Why are the changes needed?

Code refactor

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1661 from AngersZhuuuu/CELEBORN-748.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:47:15 +08:00
Cheng Pan
11569689be
[CELEBORN-747] Print time/bytes in human-readable format
### What changes were proposed in this pull request?

Print time/bytes in human-readable format

### Why are the changes needed?

Make logs readable

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1659 from pan3793/minor.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:46:12 +08:00
cchung100m
4e1e97b2c2
[CELEBORN-478] Replace putIfAbsent with computeIfAbsent in ConcurrentHashMap in Java code
### What changes were proposed in this pull request?

Replace `putIfAbsent` with computeIfAbsent in ConcurrentHashMap

### Why are the changes needed?

The invoking of `putIfAbsent` will always call its value if it's a time-consuming operation. So we'd better replace `putIfAbsent` with `computeIfAbsent` in some critical paths.

### Does this PR introduce _any_ user-facing change?

No, it does not affect the user-facing API

### How was this patch tested?

current UT

Closes #1567 from cchung100m/CELEBORN-478.

Lead-authored-by: cchung100m <cchung100m@cs.ccu.edu.tw>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:44:18 +08:00
Angerszhuuuu
bd7c2ea35a [CELEBORN-746][BUILD] Rename project files from rss-xx to celeborn-xx
### What changes were proposed in this pull request?
Rename project files from rss-xx to celeborn-xx

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1660 from AngersZhuuuu/CELEBORN-746.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-29 16:30:02 +08:00
Cheng Pan
fd0cf11eca
[CELEBORN-738] Enable CI for Java 17
### What changes were proposed in this pull request?

Enable CI for Celeborn Master/Worker and Client with Spark 3.3/3.4

### Why are the changes needed?

Ensure Celeborn works on Java 17.

Note: there may be some code paths that are not covered by tests, we should fix them in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA

Closes #1649 from pan3793/CELEBORN-738.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 13:47:55 +08:00
Angerszhuuuu
1a53db22ce
[CELEBORN-739] Rename HeartbeatResponse to HeartbeatFromWorkerResponse
### What changes were proposed in this pull request?
Rename HeartbeatResponse to HeartbeatFromWorkerResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1651 from AngersZhuuuu/CELEBORN-739.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 13:08:08 +08:00
Cheng Pan
b308ac6717
[CELEBORN-742][BUILD] Bump Hadoop 3.2.4
### What changes were proposed in this pull request?

Bump Hadoop from 3.2.1 to 3.2.4.

### Why are the changes needed?

Always use the latest patched version.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1654 from pan3793/CELEBORN-742.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-29 11:48:45 +08:00
Cheng Pan
78327ebd4a [CELEBORN-743][BUILD] Bump commons-io to 2.13.0
### What changes were proposed in this pull request?

Bump commons-io to 2.13.0

### Why are the changes needed?

- https://commons.apache.org/proper/commons-io/changes-report.html#a2.9.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.10.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.11.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.12.0
- https://commons.apache.org/proper/commons-io/changes-report.html#a2.13.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1655 from pan3793/CELEBORN-743.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-29 10:26:57 +08:00
Cheng Pan
3c6d90b5e5 [CELEBORN-741][BUILD] Bump Spark to latest patched version
### What changes were proposed in this pull request?

Bump Spark

- from 3.2.2 to 3.2.4
- from 3.3.1 to 3.3.2
- from 3.4.0 to 3.4.1

### Why are the changes needed?

Keep Spark version update-to-date

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1653 from pan3793/CELEBORN-741.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-29 10:23:40 +08:00
Cheng Pan
c33aabfa37 [CELEBORN-740] Remove usage of AccessController
### What changes were proposed in this pull request?

Remove usage of deprecated `java.security.AccessController`

### Why are the changes needed?

`AccessController` is deprecated for removal since Java 17

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/security/AccessController.html

Recover building for Java 17

```
[INFO] compiling 72 Scala sources and 209 Java sources to /home/runner/work/incubator-celeborn/incubator-celeborn/common/target/classes ...
Error:  /home/runner/work/incubator-celeborn/incubator-celeborn/common/src/main/scala/org/apache/celeborn/common/serializer/SerializationDebugger.scala:71: class AccessController in package security is deprecated
Error: [ERROR] one error found
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

```
scala> System.getProperty("java.version")
res0: String = 1.8.0_332

scala> System.getProperty("sun.io.serialization.extendedDebugInfo")
res1: String = null

scala> java.lang.Boolean.getBoolean("sun.io.serialization.extendedDebugInfo")
res2: Boolean = false
```

Closes #1652 from pan3793/CELEBORN-740.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-29 10:21:02 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
Cheng Pan
c2352a2f9f [CELEBORN-736][BUILD] Bump commons-lang3 3.12.0
### What changes were proposed in this pull request?

Bump commons-lang3 to latest version

### Why are the changes needed?

- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.11
- https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1648 from pan3793/CELEBORN-736.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 21:15:44 +08:00
Angerszhuuuu
4c4e18b0d6 [CELEBORN-735] Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse
### What changes were proposed in this pull request?
Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1647 from AngersZhuuuu/CELEBORN-735.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 20:17:56 +08:00
Angerszhuuuu
a672db719a [CELEBORN-734] Remove unused RPC ReregisterWorkerResonse
### What changes were proposed in this pull request?
Remove unused RPC ReregisterWorkerResonse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1646 from AngersZhuuuu/CELEBORN-734.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 19:59:53 +08:00
Angerszhuuuu
590198ecea [CELEBORN-666][FOLLOWUP] Rename all RPC blacklist fields
### What changes were proposed in this pull request?
In this pr, we rename all RPC blacklist fields,  it won't have have compatibility issues.

For RPC `GetBlacklist` and `GetBlacklistResponse` we won't change it, since it won't be used in next release, so we can remove these two RPC in next release.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1643 from AngersZhuuuu/CELEBORN-666-RPC.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 19:49:44 +08:00
Angerszhuuuu
ad13b04f2e [CELEBORN-732] Remove unused RPC ThreadDump & ThreadDumpResponse
### What changes were proposed in this pull request?
Remove unused RPC ThreadDump & ThreadDumpResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1645 from AngersZhuuuu/CELEBORN-732.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 19:43:39 +08:00
Angerszhuuuu
63f22342e9
[CELEBORN-730] Remove unused SlaveLostResponse
### What changes were proposed in this pull request?
Remove unused SlaveLostResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1644 from AngersZhuuuu/CELEBORN-730.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 19:35:23 +08:00
onebox-li
1b74d85fb1 [CELEBORN-725][MINOR] Refine congestion code
### What changes were proposed in this pull request?
Refine the congestion relevant code/log/comments

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1637 from onebox-li/improve-congestion.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 18:31:40 +08:00
Cheng Pan
3d7c1fa0ae [CELEBORN-729] Fix typo PbRegisterShuffle#numMappers
### What changes were proposed in this pull request?

Fix typo `numMapppers`, should be `numMappers`

### Why are the changes needed?

Fix typo

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Protobuf serde depends on message field seq no, not name.

Closes #1642 from pan3793/CELEBORN-729.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 18:28:34 +08:00
Cheng Pan
b821349c4a
[CELEBORN-727][TEST] Fix flaky test RssHashCheckDiskSuite
### What changes were proposed in this pull request?

Fix the flaky test by enlarging `celeborn.client.shuffle.expired.checkInterval`

### Why are the changes needed?

```
RssHashCheckDiskSuite:
- celeborn spark integration test - hash-checkDiskFull *** FAILED ***
  868 was not less than 0 (RssHashCheckDiskSuite.scala:83)
```

https://github.com/apache/incubator-celeborn/actions/runs/5396767745/jobs/9800766633

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA, and should observe CI,

Closes #1640 from pan3793/CELEBORN-727.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 17:59:54 +08:00
Demon Liang
a1199a9895 [CELEBORN-728] Celeborn won't clean remnant application directory on HDFS if worker is restarted
### What changes were proposed in this pull request?
To clean the remnant application directory after Celeborn Worker is restarted.

### Why are the changes needed?
Remnant application directories will not be deleted, because `hadoopFs.listFiles(path,false)` will not list directories.

### Does this PR introduce _any_ user-facing change?
No.

Closes #1641 from Demon-Liang/0.3-dev.

Authored-by: Demon Liang <liangdingwen.ldw@alipay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 42a9160c8ceaf79bae514c54dafcb5b8e12d5251)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 17:54:08 +08:00
Angerszhuuuu
afab4a0a3b [CELEBORN-696][FOLLOWUP] Remove new allocated peer workers from pushExecludedWrkers
### What changes were proposed in this pull request?
Remove new allocated location's workers from pushExecludedWrkers should also remove peers

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1636 from AngersZhuuuu/CELEBORN-696-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 17:38:36 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
zhongqiang.czq
374d735ae5
[CELEBORN-724] Fix the compatibility of HeartbeatFromApplicationRespo…
…nse with lower versions

### What changes were proposed in this pull request?
The master side will check HeartbeatFromApplication's reply field. if reply is true then it replies HeartbeatFromApplicationResponse otherwise OneWayMessageResponse.

The reply field is default false before the version 0.2.1, so master can be compatible with older client version

### Why are the changes needed?
Before the version `0.2.1`, the response of HeartbeatFromApplication is` OneWayMessageResponse`, but from `0.3.0`, the response of HeartbeatFromApplication is modified to `HeartbeatFromApplicationResponse`.
if the version of `client side `is `0.2.1` and the version of `server side is 0.3.0`, the `compatiblity issue `will occur.
The following compatiblity error will be printted.

``` java
java.io.InvalidObjectException: enum constant HEARTBEAT_FROM_APPLICATION_RESPONSE does not exist in class org.apache.celeborn.common.protocol.MessageType
	at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:2157) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1662) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2430) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2354) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2212) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) ~[?:1.8.0_362]
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) ~[?:1.8.0_362]
	at org.apache.celeborn.common.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```
``` java
Caused by: java.lang.ClassCastException: Cannot cast org.apache.celeborn.common.protocol.message.ControlMessages$HeartbeatFromApplicationResponse to org.apache.celeborn.common.protocol.message.ControlMessages$OneWayMessageResponse$
	at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_362]
	at scala.concurrent.Future.$anonfun$mapTo$1(Future.scala:500) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?]
	at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?]
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Promise.trySuccess(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.Promise.trySuccess$(Promise.scala:94) ~[scala-library-2.12.15.jar:?]
	at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) ~[scala-library-2.12.15.jar:?]
	at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:218) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?]
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
The pr is tested manually and the testing process is as follows:
1. server side is deploy using the code of latest branch-0.3.
2. spark client is deploy the version of 0.2.1, then run spark-sql to execute  3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted
3. spark client is deploy the version of 0.3.0,  then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted

This patch had conflicts when merged, resolved by
Committer: Cheng Pan <chengpan@apache.org>

Closes #1635 from zhongqiangczq/heartbeat2.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 16:04:18 +08:00
Angerszhuuuu
33cf343d20 [CELEBORN-666][REFACTOR] Unify exclude and blacklist related configuration
### What changes were proposed in this pull request?
Unify exclude and blacklist related configuration

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1633 from AngersZhuuuu/CELEBORN-666-NEW.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 10:59:58 +08:00
zky.zhoukeyong
57b0e815cf [CELEBORN-656] Batch revive RPCs in client to avoid too many requests
### What changes were proposed in this pull request?
This PR batches revive requests and periodically send to LifecycleManager to reduce number or RPC requests.

To be more detailed. This PR changes Revive message to support multiple unique partitions, and also passes a set unique mapIds for checking MapEnd. Each time ShuffleClientImpl wants to revive, it adds a ReviveRquest to ReviveManager and wait for result. ReviveManager batches revive requests and periodically send to LifecycleManager (deduplicated by partitionId). LifecycleManager constructs ChangeLocationsCallContext and after all locations are notified, it replies to ShuffleClientImpl.

### Why are the changes needed?
In my test 3T TPCDS q23a with 3 Celeborn workers, when kill a worker, the LifecycleManger will receive 4.8w Revive requests:
```
[emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out.1 |grep -i revive |wc -l
64364
```
After this PR, number of ReviveBatch requests reduces to 708:
```
[emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out |grep -i revive |wc -l
2573
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test. I have tested:

1. Disable graceful shutdown, kill one worker, job succeeds
2. Disable graceful shutdown, kill two workers successively, job fails as expected
3. Enable graceful shutdown, restart two workers successively, job succeeds
4. Enable graceful shutdown, restart two workers successively, then kill the third one, job succeeds

Closes #1588 from waitinfuture/656-2.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-06-27 22:11:04 +08:00
Shuang
fe2f76dba6 [CELEBORN-717][FLINK][FOLLOWUP] Fix ResultPartition lost numBytesOut/numBuffersOut metrics
### What changes were proposed in this pull request?
Metics update logic need align with Flink 1.17/1.15

### Why are the changes needed?
See [1626](https://github.com/apache/incubator-celeborn/pull/1626) And metics update logic need align with Flink 1.17/1.15

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tpcds Manual

Closes #1631 from RexXiong/CELEBORN-717-FOLLOWUP.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-06-27 21:47:41 +08:00
zky.zhoukeyong
ebff17ec3c
[CELEBORN-721] Fix concurrent bug in ChangePartitionManager
### What changes were proposed in this pull request?
Fixes concurrent bug in ChangePartitionManager.

### Why are the changes needed?
Before this PR, ```ChangePartitionManager.start``` tries to synchronize on ```requests``` in the body
of ```run()```, but the synchronized keyword was put outside of the ```batchHandleChangePartitionExecutors.submit```,
which has no effect.

When I was testing https://github.com/apache/incubator-celeborn/pull/1588 , I encountered unexpected situations that
when all ```rss-lifecycle-manager-change-partition-executor``` threads are idle, the ```inBatchPartitions``` is still not
empty:
```
23/06/27 20:35:55 INFO ChangePartitionManager: Inside run, shuffleId 0 inBatchPartitions size 834
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1634 from waitinfuture/721.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 21:30:47 +08:00
Angerszhuuuu
4c67325a3d
[CELEBORN-720][SPARK] Correct metric peakExecutionMemory of SortBasedShuffleWriter
### What changes were proposed in this pull request?
Currently SortBasedShuffleWriter won't update peakMemoryUsedBytes, this pr support this.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1632 from AngersZhuuuu/CELEBORN-720.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 18:40:06 +08:00
mingji
40760ede3a [CELEBORN-568] Support storage type selection
### What changes were proposed in this pull request?
1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now.
2. Add new buffer size for HDFS file writers.
3. Worker support empty working dirs.

### Why are the changes needed?
Support HDFS only scenario.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1619 from FMX/CELEBORN-568.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-27 18:07:08 +08:00
Angerszhuuuu
a2b215bd47 [CELEBORN-718] Support override Hadoop Conf by Celeborn Conf with celeborn.hadoop. prefix
### What changes were proposed in this pull request?
 Celeborn generate hadoop configuration should respect Celeborn conf

### Why are the changes needed?

In spark client side we should write like `spark.celeborn.hadoop.xxx.xx`
In server side we should write like `celeborn.hadoop.xxx.xxx`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1629 from AngersZhuuuu/CELEBORN-719.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-27 17:00:47 +08:00
zky.zhoukeyong
809c76a2e4 [CELEBORN-718] Decrease RemainingReviveTimes regardless worker is excluded or not
…s excluded or not

### What changes were proposed in this pull request?
This PR makes ReviveTimes decrease regardless of the partition location is excluded or not.

### Why are the changes needed?
In such testing setup:

- 3 Celeborn workers
- Client side blacklist enabled ```spark.celeborn.client.push.blacklist.enabled=true```
- Replication is on ```spark.celeborn.client.push.replicate.enabled=true```
- Successively kill 2 workers

I expect the task fail because of revive failure (When replication is on, we need at least 2 workers), but in stead
the tasks hang forever. When digging into the logs I found the ```remain revive times``` does not decrease, leading
to infinite revive loop.
```
23/06/27 14:00:57 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:01 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:05 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:09 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:13 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:17 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:21 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:25 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:29 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:33 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:37 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:41 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:45 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:49 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:53 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:01:57 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:02:01 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:02:05 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:02:09 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
23/06/27 14:02:13 ERROR ShuffleClientImpl: Push data to xxx:xxx failed for shuffle 0 map 998 attempt 1 partition 666 batch 1, remain revive times 5.
```

The reason is before this PR, the revive times will not decrease if the partition location is excluded, which I don't see a
reason for that.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually test.

Closes #1628 from waitinfuture/718.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-27 15:21:09 +08:00
Shuang
22b21295e8
[CELEBORN-717][FLINK] Fix ResultPartition lost numBytesOut/numBuffersOut metrics
### What changes were proposed in this pull request?
Reset  numBytesOut/numBuffersOut metrics for RemoteShuffleResultPartition

### Why are the changes needed?
Currently ResultPartition lost numBytesOut/numBuffersOut metrics, this will cause Flink AdaptiveScheduler can not dynamically adjust the task parallelism based on the input amount of data

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test.

Closes #1626 from RexXiong/CELEBORN-717.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-27 11:49:00 +08:00
Cheng Pan
1753556565
[CELEBORN-713] Local network binding support IP or FQDN
### What changes were proposed in this pull request?

This PR aims to make network local address binding support both IP and FQDN strategy.

Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-`

### Why are the changes needed?

Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting.

For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause

```
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874)
	at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735)
	at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
```

### Does this PR introduce _any_ user-facing change?

Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior.

### How was this patch tested?

Manually testing with `celeborn.network.bind.preferIpAddress=false`
```
Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.143.252

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.173.94

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.149.42

starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303.
```

Closes #1622 from pan3793/CELEBORN-713.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 09:42:11 +08:00