Commit Graph

238 Commits

Author SHA1 Message Date
onebox-li
405b2801fa [CELEBORN-810] Fix some typos and grammar
### What changes were proposed in this pull request?
Fix some typos and grammar

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1733 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-19 18:35:38 +08:00
Cheng Pan
0db919403e Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…"
This reverts commit e56a8a8bed.
2023-07-19 15:08:45 +08:00
Fu Chen
16d9c657c2
[CELEBORN-805][FOLLOWUP] Remove unnecessary TODO
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

cleanup the unnecessary TODO which introduced in https://github.com/apache/incubator-celeborn/pull/1727

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Review

Closes #1728 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:37:21 +08:00
Fu Chen
7c6644b1a7
[CELEBORN-805] Immediate shutdown of server upon completion of unit test to prevent potential resource leakage
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly.

This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes.

Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports.

```
$ jps
1138131 Jps
1130743 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1130743
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 127.0.0.1:12345         0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:41563           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:42905           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44419           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:45025           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:44799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39053           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39029           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:39475           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:40153           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33051           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:33449           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:34073           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35347           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:35971           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 0.0.0.0:36799           0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:40775     0.0.0.0:*               LISTEN      1130743/java
tcp        0      0 192.168.1.151:44457     0.0.0.0:*               LISTEN      1130743/java
```

After this PR:

```
$ jps
1114423 Jps
1107544 sbt-launch-1.9.0.jar
$ netstat -lntp | grep 1107544
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1727 from cfmcgrady/shutdown.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-18 13:12:51 +08:00
zky.zhoukeyong
e56a8a8bed [CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…
…up client

### What changes were proposed in this pull request?
Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from
client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response,
client calls ```unregisterShuffle``` for cleanup.

### Why are the changes needed?
Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver
without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo):
![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005)

After this PR, the number of PartitionLocation objects decreases to 275 thousands
![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc)

This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and  manual test.

Closes #1719 from waitinfuture/798.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-17 18:14:10 +08:00
zky.zhoukeyong
03717375fc [CELEBORN-790][FOLLOWUP] Use allocator.compositeDirectBuffer to track memory leak
### What changes were proposed in this pull request?
According to https://github.com/apache/incubator-celeborn/pull/1709#discussion_r1260133078

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1711 from waitinfuture/790-fu.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-13 10:01:11 +08:00
zky.zhoukeyong
9e74cf3547
[CELEBORN-790] Use pooled direct allocator for flusher's CompositeByteBuf
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
Before this PR, ```Flusher#takeBuffer``` returns a ```CompositeByteBuf``` which is unpooled and on heap:
```
      buffer = Unpooled.compositeBuffer(maxComponents)
```
```
    public static CompositeByteBuf compositeBuffer(int maxNumComponents) {
        return new CompositeByteBuf(ALLOC, /*direct*/ false, maxNumComponents);
    }
```
When consolidation happens, the data will be copied from direct memory to heap memory, causing OOM and
perf degration.
With this PR, in my test cases of shuffling 14G for three 1G/1G workers, I don't see disk buffer larger than direct memory,
nor do I encounter high GC.

### Does this PR introduce _any_ user-facing change?

This patch fixes some OOM issues.

### How was this patch tested?
Passes GA and manual test.

Closes #1709 from waitinfuture/790.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-11 22:44:57 +08:00
zky.zhoukeyong
3bc611bdd6 [CELEBORN-787] Add chunk related UTs for FileWriter
### What changes were proposed in this pull request?
Add chunk related UTs for FileWriter.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA.

Closes #1705 from waitinfuture/787.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 16:05:18 +08:00
caojiaqing
d64e0091f1 [CELEBORN-785] Add worker side partition hard split threshold
### What changes were proposed in this pull request?
Add a configuration `celeborn.worker.shuffle.partitionSplit.max`  to ensure that, in soft mode, individual partition files are limited to a size smaller than the configured value

### Why are the changes needed?

In soft mode, there may be situations where individual partition files are exceptionally large, which can result in excessively long sort times in skewed scenarios.

### Does this PR introduce _any_ user-facing change?
`celeborn.worker.shuffle.partitionSplit.max` defalut value 2g

### How was this patch tested?
none

Closes #1701 from JQ-Cao/785.

Authored-by: caojiaqing <caojiaqing@bilibili.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 14:14:41 +08:00
Shuang
02492fe9f8 [CELEBORN-626][FOLLOWUP] Fix wrong update chunkOffsets when final flush with empty buffer
### What changes were proposed in this pull request?
Final flush with none empty buffer.

### Why are the changes needed?
This bug was introduced by [CELEBORN-626](https://github.com/apache/incubator-celeborn/pull/1534/files)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass UT

Closes #1702 from RexXiong/CELEBORN-626-dev.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-11 00:28:54 +08:00
zky.zhoukeyong
334f0e5ba4 [CELEBORN-782] Make max components configurable for FileWriter#flushBuffer
### What changes were proposed in this pull request?
Make max components configurable for FileWriter#flushBuffer.

### Why are the changes needed?
When max components of ```CompositeByteBuf``` is too big (hard coded 256 before this PR), netty's offheap memory
usage will be several times bigger than true usage:
```
 Direct memory usage: 1044.0 MiB/4.0 GiB, disk buffer size: 255.9 MiB
```
When set to 1, netty's memory usage will be very close to disk buffer:
```
Direct memory usage: 376.0 MiB/4.0 GiB, disk buffer size: 353.0 MiB
```
but when it set too low, for example 1, performance might decrease, especially for sort pusher:
```
max components:   1 vs. 16
shuffle write time:   34s vs. 30s
```
For hash pusher, difference is not so big:
```
max components:   1 vs. 8
shuffle write time:   25s vs. 23s
```
This PR makes the parameter configurable.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA, and manual test.

Closes #1697 from waitinfuture/782.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-10 18:00:07 +08:00
mingji
a7ff28c26d [CELEBORN-779] Fix sorted file size summary overflow
### What changes were proposed in this pull request?
To fix sorted file size summary counter overflow. Sorted file size summary can be larger than the integer's max value. It should be applied to branch-0.3, branch-0.2, and main branch.

### Why are the changes needed?
This graph is incorrect.
<img width="1500" alt="lQLPJx4a6A76xo7NBOHNC7iwLBx50tXnv5IEoQTQxMAMAA_3000_1249" src="https://github.com/apache/incubator-celeborn/assets/4150993/8d288087-64d7-4569-ba04-05bb1915118a">

### Does this PR introduce _any_ user-facing change?
It will cause user-facing changes.  Celeborn worker's metric for sorted file summary is incorrect and this patch will correct it.

### How was this patch tested?
UT.

Closes #1694 from FMX/CELEBORN-779.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-10 11:29:47 +08:00
Cheng Pan
494c116818
[CELEBORN-778] Rename MemoryManagerStat to ServingState
### What changes were proposed in this pull request?

Rename

```
enum MemoryManagerStat { resumeAll, pausePushDataAndReplicate, pausePushDataAndResumeReplicate }
```

to

```
enum ServingState { NONE_PAUSED, PUSH_AND_REPLICATE_PAUSED, PUSH_PAUSED }
```

### Why are the changes needed?

`MemoryManagerStat` indicates the worker serving functionalities, and it's weird to represent the state using verbs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1691 from pan3793/CELEBORN-778.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-10 11:17:52 +08:00
Angerszhuuuu
fd82855e8b [CELEBORN-777][FOLLOWUP] CongestionControl getPotentialConsumeSpeed throw … …/zero error
### What changes were proposed in this pull request?
Fix compute bug

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1693 from AngersZhuuuu/CELEBORN-777.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-10 10:37:19 +08:00
Angerszhuuuu
52dcd3b5df [CELEBORN-777][BUG] CongestionControl getPotentialConsumeSpeed throw /zero error
### What changes were proposed in this pull request?
In `TimeSlidingHub.add()` `_deque` will clear then add the pair.

```
      if (nodesToAdd >= maxQueueSize) {
        // The new node exceed existing sliding list, need to clear all old nodes
        // and create a new sliding list
        _deque.clear();
        _deque.add(Pair.of(currentTimestamp, (N) newNode.clone()));
        sumNode = (N) newNode.clone();
        return;
      }

```

Then when call `BufferStatusHub.avgBytesPerSec()`,  `currentNumBytes` can be `> 0` but `getCurrentTimeWindowsInMills` may return 0. Cause the error.

```
  public long avgBytesPerSec() {
    long currentNumBytes = sum().numBytes();
    if (currentNumBytes > 0) {
      return currentNumBytes * 1000 / (long) getCurrentTimeWindowsInMills();
    }
    return 0L;
  }
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1690 from AngersZhuuuu/CELEBORN-777.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-08 21:46:37 +08:00
Cheng Pan
e314f85087
[CELEBORN-614][FOLLOWUP] Fix flushOnMemoryPressure condition
### What changes were proposed in this pull request?

Fix the refactor bug of CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517).

### Why are the changes needed?

This is a bug fix, the condition `writer.getException != null` was inverted accidentally during CELEBORN-614 (https://github.com/apache/incubator-celeborn/pull/1517), which causes the trim became no-op.

### Does this PR introduce _any_ user-facing change?

No. The bug was caused by an unreleased commit.

### How was this patch tested?

Set Worker off-heap memory to 2G, and run 1T tera sort.

Before: the trim does not trigger disk buffer flush, causing the worker can not to recover from the pause pushdata state, then Job failed.

After: the trim correctly triggers disk buffer flush, releases the worker memory, and the Job succeeded.

<img width="1653" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ef62c78-e6a9-497f-9dac-d3f712e830cc">

Closes #1689 from pan3793/CELEBORN-614-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-07-07 15:33:09 +08:00
mingji
d0ecf83fec [CELEBORN-764] Fix celeborn on HDFS might clean using app directories
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-05 23:11:50 +08:00
Angerszhuuuu
693172d0bd [CELEBORN-751] Rename remain rss related class name and filenames etc
### What changes were proposed in this pull request?
Rename remain rss related class name and filenames etc...

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1664 from AngersZhuuuu/CELEBORN-751.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-07-04 10:20:08 +08:00
Fu Chen
3964861fd7
[CELEBORN-726][FOLLOWUP] Eliminate 'TODO' comments within the Controller class
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

eliminate comments introduced in https://github.com/apache/incubator-celeborn/pull/1650

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1672 from cfmcgrady/primary-replica-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-03 13:40:30 +08:00
Cheng Pan
a1be02b4fa [CELEBORN-757] Improve metrics method signature and code style
### What changes were proposed in this pull request?

- gauge method definition improvement. i.e.

  before
  ```
  def addGauge[T](name: String, f: Unit => T, labels: Map[String, String])
  ```
  after
  ```
  def addGauge[T](name: String, labels: Map[String, String])(f: () => T)
  ```
  which improves the caller-side code style
  ```
  addGauge(name, labels) { () =>
    ...
  }
  ```

- remove unnecessary Java/Scala collection conversion. Since Scala 2.11 does not support SAM, the extra implicit function is required.

- leverage Logging to defer message evaluation

- UPPER_CASE string constants

### Why are the changes needed?

Improve code quality and performance(maybe)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1670 from pan3793/CELEBORN-757.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-07-03 11:56:43 +08:00
xiyu.zk
381165d4e7
[CELEBORN-755] Support disable shuffle compression
### What changes were proposed in this pull request?
Support to decide whether to compress shuffle data through configuration.

### Why are the changes needed?
Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation.

### Does this PR introduce _any_ user-facing change?
no.

Closes #1669 from kerwin-zk/celeborn-755.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-07-01 00:03:50 +08:00
Fu Chen
047e90b17b
[CELEBORN-756][WORKER] Refactor PushDataHandler class to utilize while loop
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, use `while` loop for performance-sensitive code

worker's flame graph before:

![截屏2023-06-30 下午5 58 02](https://github.com/apache/incubator-celeborn/assets/8537877/28c199b6-a29b-4501-8064-e0f2ddb2a8b9)

after:

![截屏2023-06-30 下午5 58 18](https://github.com/apache/incubator-celeborn/assets/8537877/c6134959-5f78-436b-aa29-a78882b09e84)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1668 from cfmcgrady/while-loop-2.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-30 18:13:53 +08:00
Fu Chen
baa0d0b3b4
[CELEBORN-750] Simplify get Netty used direct memory logic
### What changes were proposed in this pull request?

netty has exposed the public API `PlatformDependent.usedDirectMemory()` to get netty used direct memory since [netty-4.1.35.Final](https://github.com/netty/netty/pull/8945), simplifies the logic

### Why are the changes needed?

simplifies the get netty used direct memory logic

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1662 from cfmcgrady/netty-used-memory.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-29 18:24:16 +08:00
Fu Chen
adbd38a926
[CELEBORN-726][FOLLOWUP] Update data replication terminology from master/slave to primary/replica in the codebase
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1639 from cfmcgrady/primary-replica.

Lead-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 17:07:26 +08:00
Angerszhuuuu
1fd8833756
[CELEBORN-748] Rename RssHARetryClient to MasterClient
### What changes were proposed in this pull request?

Rename RssHARetryClient to MasterClient

### Why are the changes needed?

Code refactor

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1661 from AngersZhuuuu/CELEBORN-748.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:47:15 +08:00
Cheng Pan
11569689be
[CELEBORN-747] Print time/bytes in human-readable format
### What changes were proposed in this pull request?

Print time/bytes in human-readable format

### Why are the changes needed?

Make logs readable

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1659 from pan3793/minor.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 16:46:12 +08:00
Angerszhuuuu
1a53db22ce
[CELEBORN-739] Rename HeartbeatResponse to HeartbeatFromWorkerResponse
### What changes were proposed in this pull request?
Rename HeartbeatResponse to HeartbeatFromWorkerResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1651 from AngersZhuuuu/CELEBORN-739.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 13:08:08 +08:00
Fu Chen
17c1e01874
[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics
### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-29 09:47:02 +08:00
Angerszhuuuu
4c4e18b0d6 [CELEBORN-735] Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse
### What changes were proposed in this pull request?
Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1647 from AngersZhuuuu/CELEBORN-735.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-28 20:17:56 +08:00
Angerszhuuuu
ad13b04f2e [CELEBORN-732] Remove unused RPC ThreadDump & ThreadDumpResponse
### What changes were proposed in this pull request?
Remove unused RPC ThreadDump & ThreadDumpResponse

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1645 from AngersZhuuuu/CELEBORN-732.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 19:43:39 +08:00
onebox-li
1b74d85fb1 [CELEBORN-725][MINOR] Refine congestion code
### What changes were proposed in this pull request?
Refine the congestion relevant code/log/comments

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually test

Closes #1637 from onebox-li/improve-congestion.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 18:31:40 +08:00
Cheng Pan
b821349c4a
[CELEBORN-727][TEST] Fix flaky test RssHashCheckDiskSuite
### What changes were proposed in this pull request?

Fix the flaky test by enlarging `celeborn.client.shuffle.expired.checkInterval`

### Why are the changes needed?

```
RssHashCheckDiskSuite:
- celeborn spark integration test - hash-checkDiskFull *** FAILED ***
  868 was not less than 0 (RssHashCheckDiskSuite.scala:83)
```

https://github.com/apache/incubator-celeborn/actions/runs/5396767745/jobs/9800766633

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA, and should observe CI,

Closes #1640 from pan3793/CELEBORN-727.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-28 17:59:54 +08:00
Demon Liang
a1199a9895 [CELEBORN-728] Celeborn won't clean remnant application directory on HDFS if worker is restarted
### What changes were proposed in this pull request?
To clean the remnant application directory after Celeborn Worker is restarted.

### Why are the changes needed?
Remnant application directories will not be deleted, because `hadoopFs.listFiles(path,false)` will not list directories.

### Does this PR introduce _any_ user-facing change?
No.

Closes #1641 from Demon-Liang/0.3-dev.

Authored-by: Demon Liang <liangdingwen.ldw@alipay.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
(cherry picked from commit 42a9160c8ceaf79bae514c54dafcb5b8e12d5251)
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 17:54:08 +08:00
Angerszhuuuu
3985a5cbd7 [CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment
### What changes were proposed in this pull request?
Unify all blacklist related code and comment

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-28 16:28:03 +08:00
zky.zhoukeyong
57b0e815cf [CELEBORN-656] Batch revive RPCs in client to avoid too many requests
### What changes were proposed in this pull request?
This PR batches revive requests and periodically send to LifecycleManager to reduce number or RPC requests.

To be more detailed. This PR changes Revive message to support multiple unique partitions, and also passes a set unique mapIds for checking MapEnd. Each time ShuffleClientImpl wants to revive, it adds a ReviveRquest to ReviveManager and wait for result. ReviveManager batches revive requests and periodically send to LifecycleManager (deduplicated by partitionId). LifecycleManager constructs ChangeLocationsCallContext and after all locations are notified, it replies to ShuffleClientImpl.

### Why are the changes needed?
In my test 3T TPCDS q23a with 3 Celeborn workers, when kill a worker, the LifecycleManger will receive 4.8w Revive requests:
```
[emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out.1 |grep -i revive |wc -l
64364
```
After this PR, number of ReviveBatch requests reduces to 708:
```
[emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out |grep -i revive |wc -l
2573
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test. I have tested:

1. Disable graceful shutdown, kill one worker, job succeeds
2. Disable graceful shutdown, kill two workers successively, job fails as expected
3. Enable graceful shutdown, restart two workers successively, job succeeds
4. Enable graceful shutdown, restart two workers successively, then kill the third one, job succeeds

Closes #1588 from waitinfuture/656-2.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-06-27 22:11:04 +08:00
mingji
40760ede3a [CELEBORN-568] Support storage type selection
### What changes were proposed in this pull request?
1. Celeborn supports storage type selection. HDD, SSD, and HDFS are available for now.
2. Add new buffer size for HDFS file writers.
3. Worker support empty working dirs.

### Why are the changes needed?
Support HDFS only scenario.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1619 from FMX/CELEBORN-568.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-27 18:07:08 +08:00
Angerszhuuuu
a2b215bd47 [CELEBORN-718] Support override Hadoop Conf by Celeborn Conf with celeborn.hadoop. prefix
### What changes were proposed in this pull request?
 Celeborn generate hadoop configuration should respect Celeborn conf

### Why are the changes needed?

In spark client side we should write like `spark.celeborn.hadoop.xxx.xx`
In server side we should write like `celeborn.hadoop.xxx.xxx`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1629 from AngersZhuuuu/CELEBORN-719.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2023-06-27 17:00:47 +08:00
Cheng Pan
1753556565
[CELEBORN-713] Local network binding support IP or FQDN
### What changes were proposed in this pull request?

This PR aims to make network local address binding support both IP and FQDN strategy.

Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-`

### Why are the changes needed?

Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting.

For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause

```
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874)
	at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735)
	at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
```

### Does this PR introduce _any_ user-facing change?

Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior.

### How was this patch tested?

Manually testing with `celeborn.network.bind.preferIpAddress=false`
```
Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.143.252

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.173.94

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.149.42

starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303.
```

Closes #1622 from pan3793/CELEBORN-713.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 09:42:11 +08:00
Cheng Pan
2b82194ce0 [CELEBORN-715] Change master URL schema from rss to celeborn
### What changes were proposed in this pull request?

Change Celeborn Master URL from `rss://<host>:<port>` to `celeborn://<host>:<port>`

### Why are the changes needed?

Respect the project name.

### Does this PR introduce _any_ user-facing change?

Yes, migration guide is updated accordingly.

### How was this patch tested?

Pass GA.

Closes #1624 from pan3793/CELEBORN-715.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-26 22:30:20 +08:00
Fu Chen
1b3ec61690 [CELEBORN-711][TEST] Rework PushDataTimeoutTest
### What changes were proposed in this pull request?

1. separated push data timeout tests and push merge data timeout tests in `PushDataTimeoutTest`
2. updated the test results assertion
3. rework `pushdata timeout will add to blacklist`

### Why are the changes needed?

ensure that the timeout behavior is correctly implemented

https://github.com/apache/incubator-celeborn/pull/1613#discussion_r1236423721

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #1620 from cfmcgrady/push-timeout-test.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-26 13:45:27 +08:00
zky.zhoukeyong
6b82ecdfa0 [CELEBORN-712] Make appUniqueId a member of ShuffleClientImpl and refactor code
### What changes were proposed in this pull request?
Make appUniqueId a member of ShuffleClientImpl and remove applicationId from RPC messages across client side, so it won't cause compatibility issues.

### Why are the changes needed?
Currently Celeborn Client is bound to a single application id, so there's no need to pass applicationId around in many RPC messages in client side.

### Does this PR introduce _any_ user-facing change?
In some logs the application id will not be printed, which should not be a problem.

### How was this patch tested?
UTs.

Closes #1621 from waitinfuture/appid.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-25 21:37:16 +08:00
Cheng Pan
679f9cbf58 [CELEBORN-708] Fix commit metrics in application heartbeat
### What changes were proposed in this pull request?

- Fix commit metrics in application heartbeat
- Change client side application heartbeat message log level to info
- Improve heartbeat log by unify the word "heartbeat"

### Why are the changes needed?

`commitHandler.commitMetrics()` has side effects, multiple calls to get values independently is incorrect.

```
def commitMetrics(): (Long, Long) = (totalWritten.sumThenReset(), fileCount.sumThenReset())
```

### Does this PR introduce _any_ user-facing change?

Yes, bug fix.

### How was this patch tested?

Review.

Closes #1617 from pan3793/CELEBORN-708.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-21 22:34:24 +08:00
Fu Chen
c6113c10e5 [CELEBORN-703][WORKER][PERF] Avoid calling CelebornConf#get multi-time when PushDataHandler handle PushData/PushMergedData
### What changes were proposed in this pull request?

As title.

the worker's frame graph before:

![image](https://github.com/apache/incubator-celeborn/assets/8537877/68a0a1fd-34c2-4618-9146-a2d66c951645)

the worker's frame graph after:

![image](https://github.com/apache/incubator-celeborn/assets/8537877/268e8109-737d-4440-b2b8-60687ef090cb)

### Why are the changes needed?

improve the worker's perf.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

existing UT and manually tested.

Closes #1613 from cfmcgrady/push-data-handler-conf.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-21 20:02:44 +08:00
zky.zhoukeyong
6ca2bb2c6f
[CELEBORN-701][COMPATIBILITY] Fix compatibility issue caused by pushdata timeout
### What changes were proposed in this pull request?
Fixes compatibility issue caused by change of push data timeout.

### Why are the changes needed?
When I test with branch-0.2 client with main server side, I got the following error:
```
23/06/20 17:42:34,538 ERROR [push-timeout-checker-12] PushDataHandler: PushData replication failed for partitionLocation: PartitionLocation[
  id-epoch:767-0
  host-rpcPort-pushPort-fetchPort-replicatePort:192.168.1.17-37687-42605-42891-41319
  mode:MASTER
  peer:(host-rpcPort-pushPort-fetchPort-replicatePort:192.168.1.18-45201-35749-41831-46853)
  storage hint:StorageInfo{type=MEMORY, mountPoint='/mnt/disk4', finalResult=false, filePath=}
  mapIdBitMap:null]
org.apache.celeborn.common.exception.CelebornIOException: PUSH_DATA_TIMEOUT_SLAVE
        at org.apache.celeborn.common.network.client.TransportResponseHandler.failExpiredPushRequest(TransportResponseHandler.java:125)
        at org.apache.celeborn.common.network.client.TransportResponseHandler.lambda$new$0(TransportResponseHandler.java:96)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
```
The error is because in main branch ReserveSlots added a new field ```pushDataTimeout```, so when client is branch-0.2, the default value will be 0, and always trigger timeout.

### Does this PR introduce _any_ user-facing change?
Yes, this PR fixes a bug when client is branch-0.2 and server is main.

### How was this patch tested?
Manual test.

Closes #1611 from waitinfuture/701.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-20 18:43:02 +08:00
zky.zhoukeyong
255661bbb7 [CELEBORN-696] Fix bugs related with shutting down and excluded workers
### What changes were proposed in this pull request?
1. Foreach PartitionLocation returned from Register Shuffle, remove from worker's local excluded list to refresh the local information.
2. ChangeLocationResponse will also return whether oldPartition is excluded in LifecycleManager. If so, remove it from Executor's excluded list.
3. Always trigger commit files for shutting down workers returned from HeartbeatFromApplicationResponse.
4. HeartbeatFromWorker sends the correct disk infos regardless of shutting down or not.

After this PR, the priority of excluded list is Master > LifecycleManager > Executor.

### Why are the changes needed?
During test with graceful turned on(workers have static rpc/push/fetch/replicate ports) and consistently restart one out of three workers, I encountered several bugs.
1. First I killed worker A, then Executor's client's local excluded list will contain A, after A stopped, I started it again, then master will offer slots on A, so we should remove from the executor's excluded list then.
2. When I kill-and-start a worker twice in a short time smaller than the app heartbeat interval, the second time WorkerStatusTracker will not trigger commit files because the local cache for the worker has not been refreshed.
3. When a worker is shutting down, in its heartbeat it passes empty diskInfos, and master blindly added to excluded list. We want a worker be either in the excluded list, or in the shutting down list, exclusively. If a worker is in excluded list, then LIfecycleManager will not trigger commit files when handle heartbeat response; on the other hand, if a worker is in the shutting down list, LifecycleManager will trigger commit files on it. So we must make it correct that a shutting down worker be in the shutting down list.

### Does this PR introduce _any_ user-facing change?
Yes, it fixes several bugs described above.

### How was this patch tested?
Manual test.

Closes #1606 from waitinfuture/696.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-06-20 16:47:07 +08:00
onebox-li
88586d6c15 [CELEBORN-697] Fix assignment of DeviceInfo deviceStatAvailable
### What changes were proposed in this pull request?
DeviceInfo `deviceStatAvailable` 's variable name and assignment does not match

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Cluster test

Closes #1607 from onebox-li/fix-deviceStatAvailable.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-20 15:53:15 +08:00
onebox-li
3af81057e9 [CELEBORN-698] Fix LocalDeviceMonitor::readWriteError judge
### What changes were proposed in this pull request?
If dataDir does not exists, it will skip check in DeviceMonitor::readWriteError.
It will cause that a disk found READ_OR_WRITE_FAILURE when creating FileWriter error, may change to be healthy in the subsequent disk checker.
```
23/06/19 19:09:52,718 ERROR [dispatcher-event-loop-16] StorageManager: Create FileWriter for /data7/rss_storage/rss-worker/shuffle_data/application_1684305681931_1943199/180/79-0-0 of mount /data7 failed, report to DeviceMonitor
java.io.IOException: No such file or directory
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.createNewFile(File.java:1012)
        at org.apache.celeborn.service.deploy.worker.storage.StorageManager.createWriter(StorageManager.scala:330)
        ...
        at java.lang.Thread.run(Thread.java:745)
23/06/19 19:09:52,718 ERROR [dispatcher-event-loop-16] LocalDeviceMonitor: Receive non-critical exception, disk: /data7, java.io.IOException: java.io.IOException: No such file or directory

updated state
DiskInfo(maxSlots: 0, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /data7, usableSpace: 536870912000, avgFlushTime: 0, avgFetchTime: 0, activeSlots: 0) status: READ_OR_WRITE_FAILURE dirs /data7/rss_storage/rss-worker/shuffle_data

after disk checker, updated stated
DiskInfo(maxSlots: 0, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /data7, usableSpace: 536870912000, avgFlushTime: 0, avgFetchTime: 0, activeSlots: 0) status: HEALTHY dirs /data7/rss_storage/rss-worker/shuffle_data
```

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test

Closes #1608 from onebox-li/fix-disk-check.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-20 15:51:53 +08:00
Cheng Pan
85be99548a
[CELEBORN-685][WORKER][HDFS] Fix permission on creating shuffle dir on HDFS
### What changes were proposed in this pull request?

Correct the FsPermission 755.

### Why are the changes needed?

We should use octal 0755 or "755" instead of decimal to represent UNIX permission. String is chosen because octal is deprecated in Scala.

### Does this PR introduce _any_ user-facing change?

Yes, it's a bug fix.

### How was this patch tested?

Manually review.

Closes #1597 from pan3793/CELEBORN-685.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-06-20 15:34:24 +08:00
zky.zhoukeyong
7d634db547 [CELEBORN-695] Fix UnsupportedOperationException by refactoring WorkerInfo
### What changes were proposed in this pull request?
Refactor WorkerInfo

1. make ```diskInfos```, ```userResourceConsumption``` new maps instead of using the passed in reference
2. remove ```endpoint``` from the constructor

### Why are the changes needed?

When manually test stop-worker.sh with graceful turned on, I got the following Exception
```
23/06/19 11:04:25,665 INFO [worker-forward-message-scheduler] RssHARetryClient: connect to master master-1-1:9097.
23/06/19 11:04:27,168 ERROR [worker-forward-message-scheduler] RssHARetryClient: Send rpc with failure, has tried 15, max try 15!
org.apache.celeborn.common.exception.CelebornException: Exception thrown in awaitResult:
        at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:231)
        at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74)
        at org.apache.celeborn.common.haclient.RssHARetryClient.sendMessageInner(RssHARetryClient.java:150)
        at org.apache.celeborn.common.haclient.RssHARetryClient.askSync(RssHARetryClient.java:118)
        at org.apache.celeborn.service.deploy.worker.Worker.org$apache$celeborn$service$deploy$worker$Worker$$heartBeatToMaster(Worker.scala:306)
        at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.$anonfun$run$1(Worker.scala:332)
        at org.apache.celeborn.common.util.Utils$.tryLogNonFatalError(Utils.scala:186)
        at org.apache.celeborn.service.deploy.worker.Worker$$anon$1.run(Worker.scala:332)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.celeborn.common.exception.CelebornIOException: remove
        at org.apache.celeborn.service.deploy.master.clustermeta.ha.HAHelper.sendFailure(HAHelper.java:65)
        at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:210)
        at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:315)
        at org.apache.celeborn.common.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
        at org.apache.celeborn.common.rpc.netty.Inbox.safelyCall(Inbox.scala:222)
        at org.apache.celeborn.common.rpc.netty.Inbox.process(Inbox.scala:110)
        at org.apache.celeborn.common.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:229)
        ... 3 more
Caused by: java.lang.UnsupportedOperationException: remove
        at scala.collection.convert.Wrappers$MapWrapper$$anon$2$$anon$3.remove(Wrappers.scala:236)
        at java.util.AbstractMap.remove(AbstractMap.java:254)
        at org.apache.celeborn.common.meta.WorkerInfo.$anonfun$updateThenGetDiskInfos$2(WorkerInfo.scala:225)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at org.apache.celeborn.common.meta.WorkerInfo.updateThenGetDiskInfos(WorkerInfo.scala:224)
        at org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager.lambda$updateWorkerHeartbeatMeta$5(AbstractMetaManager.java:205)
        at java.util.Optional.ifPresent(Optional.java:159)
        at org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager.updateWorkerHeartbeatMeta(AbstractMetaManager.java:203)
        at org.apache.celeborn.service.deploy.master.clustermeta.SingleMasterMetaManager.handleWorkerHeartbeat(SingleMasterMetaManager.java:105)
        at org.apache.celeborn.service.deploy.master.Master.org$apache$celeborn$service$deploy$master$Master$$handleHeartbeatFromWorker(Master.scala:428)
        at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$20(Master.scala:326)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:207)
        ... 8 more
```

According to the suggestion from https://github.com/apache/incubator-celeborn/pull/1602#issuecomment-1596722991

### Does this PR introduce _any_ user-facing change?
Yes, it fixes bug described  in https://github.com/apache/incubator-celeborn/pull/1602

### How was this patch tested?
UTs and manual test.

Closes #1605 from waitinfuture/695.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-19 19:38:55 +08:00
sychen
e734ceb558 [MINOR] Cleanup code
### What changes were proposed in this pull request?
1. Use `<arg>-Ywarn-unused-import</arg>` to remove some unused imports
There is no way to use `<arg>-Ywarn-unused-import</arg>` at this stage
Because we have the following code
```
// Can Remove this if celeborn don't support scala211 in future
import org.apache.celeborn.common.util.FunctionConverter._
```
2. Fix scala case match not fully covered, avoid `scala.MatchError`
3. Fixed some scala compilation warnings

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1600 from cxzl25/cleanup_code.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-06-19 11:31:51 +08:00