Commit Graph

232 Commits

Author SHA1 Message Date
SteNicholas
dfb18072e2 [CELEBORN-932][FOLLOWUP] Remove StatusSystem#handleWorkerRemove from RegisterWorker to avoid duplicated behavior in RegisterWorker
### What changes were proposed in this pull request?

Remove `StatusSystem#handleWorkerRemove` from `RegisterWorker` to avoid duplicated behavior in `RegisterWorker`.

### Why are the changes needed?

`RegisterWorker` has already been improved to cover the behavior of `StatusSystem#handleWorkerRemove`. Therefore, `StatusSystem#handleWorkerRemove` is recommend to remove from `RegisterWorker` for avoiding duplicated behavior in `RegisterWorker`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2731 from SteNicholas/CELEBORN-932.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-18 14:53:04 +08:00
SteNicholas
baef31abb8 [CELEBORN-1477][FOLLOWUP] /api/v1/workers/events should support None eventType to align /sendWorkerEvent
### What changes were proposed in this pull request?

`/api/v1/workers/events` should support `None` `eventType` to align `/sendWorkerEvent`.

### Why are the changes needed?

Legal event types of `/sendWorkerEvent` are `None`, `Immediately`, `Decommission`, `DecommissionThenIdle`, `Graceful`, `Recommission`. But `/api/v1/workers/events` does not support `eventType` with `None` type.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`ApiV1MasterResourceSuite#worker resource`

Closes #2732 from SteNicholas/CELEBORN-1477.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-14 11:14:23 +08:00
Aravind Patnam
5f02f3e8f1
[CELEBORN-1589] Ensure master is leader for some POST request APIs
### What changes were proposed in this pull request?
Ensure that `excludeWorker`, `removeWorkersUnavailableInfo`, and `sendWorkerEvents` can only happen from Master leader node.

### Why are the changes needed?
prevent inconsistencies from peers

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tested against a cluster

Closes #2730 from akpatnam25/CELEBORN-1589.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-09-12 15:31:43 +08:00
Aravind Patnam
f06dba14b3 [CELEBORN-1513] Support wildcard bind in dual stack environments
### What changes were proposed in this pull request?
Support wildcard bind for RPC and HTTP servers. When wildcard address is used, the service is able to listen to both ipv4 and ipv6 traffic in dual-stack environments.

The specific scenario where this becomes relevant is as follows:

If some of the compute infrastructure is IPv4 only, some v6 only and others dual stack - the way we can have a single Celeborn infra to cater to all is by:
a) Set bind.preferip to false - so that advertised address is the host and not IP.

b) bind to wild card address

With both in place, the v4 only compute nodes will resolve the v4 address and connect to v4 ip/port.
Likewise, for v6 only.
Dual stack compute nodes will use prefer ipv6 Java flag to resolve to either v4 or v6.

This is how we are handling the combination of scenarios internally.

### Why are the changes needed?
see above.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested on a server using netstat, and tried connecting to via `nc -4` and `nc -6` to ensure connection was there.

Closes #2713 from akpatnam25/CELEBORN-1513-fix.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-09-06 16:47:42 +08:00
szt
59a39952dd [CELEBORN-1586] Add available workers Metrics
### What changes were proposed in this pull request?
Currently metrics have workers and excludedWorkers and other metadata for master service but don't have metadata for available workers. This PR supplemented this part.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local test
![image](https://github.com/user-attachments/assets/240c176c-4eef-4e3c-b34d-802291714702)

Closes #2723 from zaynt4606/availableWorker.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-09-05 13:34:52 +08:00
sychen
3ee672e15d
[CELEBORN-1565] Introduce warn-unused-import in Scala
### What changes were proposed in this pull request?
This PR aims to introduce `warn-unused-import` in Scala.

### Why are the changes needed?
There are currently many invalid imports, which can be checked using `-Ywarn-unused-import`.
And through `silencer`  plugin we can avoid some imports required in scala 2.11.

```scala
import org.apache.celeborn.common.util.FunctionConverter._
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2689 from cxzl25/CELEBORN-1565.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shaoyun Chen <csy@apache.org>
2024-08-29 13:43:17 +08:00
mingji
cdb5d21d12 [CELEBORN-1581] Fix incorrect metrics of DeviceCelebornFreeBytes and DeviceCelebornTotalBytes
### What changes were proposed in this pull request?
Fix incorrect metrics.

### Why are the changes needed?
The variable `statusSystem.workers` is a set that caused metrics output to be incorrect.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

<img width="729" alt="截屏2024-08-27 16 48 35" src="https://github.com/user-attachments/assets/c221b680-917a-4207-91fc-c142fac64b65">

Closes #2708 from FMX/b1581.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-27 20:38:36 +08:00
Bowen Liang
4844c82519 [CELEBORN-1560] Remove usages of deprecated Files.createTempDir of Guava
### What changes were proposed in this pull request?

### Why are the changes needed?

`com.google.common.io.Files#createTempDir` has been deprecated since long ago.
`java.nio.file.Files#createTempDirectory` should be used instead, as suggested in Guava's API Javadoc. (https://guava.dev/releases/33.1.0-jre/api/docs/com/google/common/io/Files.html)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2680 from bowenliang123/files-temp-dir.

Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-26 14:43:27 +08:00
szt
9f0af3456a [CELEBORN-1564] Fix actualUsableSpace of offerSlotsLoadAware condition on diskInfo
### What changes were proposed in this pull request?
fix offerSlotsLoadAware's actualUsableSpace condition on diskInfo,
considering diskReserveSize when updateDiskInfos in StorageManager,
so master don't need to calculate usableSpace when offerSlots.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2688 from zaynt4606/main.

Authored-by: szt <zaynt4606@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-26 14:17:55 +08:00
SteNicholas
b330b550ba [CELEBORN-1557] Fix totalSpace of DiskInfo for Master in HA mode
### What changes were proposed in this pull request?

Fix `totalSpace` of `DiskInfo` for Master in HA mode.

### Why are the changes needed?

The `totalSpace` of `DiskInfo` does not sync for Master in HA mode, which causes that the `totalSpace` is incorrect.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`RatisMasterStatusSystemSuiteJ#testHandleRegisterWorker`

Closes #2690 from SteNicholas/CELEBORN-1557.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-19 16:18:17 +08:00
Wang, Fei
ae41cb5ade [CELEBORN-1537] Support to remove workers unavailable info with RESTful api
### What changes were proposed in this pull request?
In [CELEBORN-1535](https://issues.apache.org/jira/browse/CELEBORN-1535), we support to disable master workerUnavilableInfo expiration.

 In this PR,  a new RestAPI  introduced for manually remove unavailable workers. Then it can be used on demand.

### Why are the changes needed?
To cleanup the works unavailable info on demand manually if we disable the expiration.

### Does this PR introduce _any_ user-facing change?
Yes, a new RESTful API.

### How was this patch tested?
UT.

Closes #2658 from turboFei/support_cleanup.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-19 11:10:33 +08:00
Aravind Patnam
74423fbd6d [CELEBORN-1549] Fix networkLocation persistence into Ratis
### What changes were proposed in this pull request?
Fixing a bug where the `networkLocation` is not persisted in Ratis, and the master defaults to `DEFAULT_RACK` when it loads the snapshot. This was missed in https://github.com/apache/celeborn/pull/2367 unfortunately, and it came up during our stress testing internally.

### Why are the changes needed?
Needed for custom network aware replication, so that networkLocation state is kept in snapshot file.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated unit test to ensure serde is correct.

Closes #2669 from akpatnam25/CELEBORN-1549.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-14 13:55:19 +08:00
zhaohehuhu
960ba2406f [CELEBORN-1531] Refactor self checks in master
### What changes were proposed in this pull request?
as title

### Why are the changes needed?

add a scheduleCheckTask method  to refactor some code

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Closes #2653 from zhaohehuhu/dev-0731.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-13 10:29:35 +08:00
zhaohehuhu
59b88beb62 [CELEBORN-1529] Read shuffle data from S3
### What changes were proposed in this pull request?
as title

### Why are the changes needed?

The change aims to make Celeborn read shuffle data from S3

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Yes

Closes #2651 from zhaohehuhu/dev-0726.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-08-12 14:53:14 +08:00
Wang, Fei
91058acfda [CELEBORN-1542] Master supports to check the worker host pattern on worker registration
### What changes were proposed in this pull request?
This pr introduce an optional config item for worker host pattern, and support to check whether the worker host matches the pattern in master end when registering the worker.

If it does not match, the register worker request will be rejected.

### Why are the changes needed?
Currently, the celeborn master allow all the workers to register. It is better to limit the workers allowed to register.

### Does this PR introduce _any_ user-facing change?

No, the config item is optional, no broken change.

### How was this patch tested?
UT.

Closes #2660 from turboFei/hosts_patterns.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-08 22:43:19 +08:00
Wang, Fei
a599ff2afe [CELEBORN-1535] Support to disable master workerUnavailableInfo expiration
### What changes were proposed in this pull request?

In this pr, it supports to disable the worker unavailable expiration by setting the timeout to -1.

### Why are the changes needed?

In our use case, we want to reserve all the worker unavailable information.
It is acceptable if we use the fixed ports and hosts, and will not occupy much memory resource.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Not needed.

Closes #2657 from turboFei/disable_Cleanup.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-08-07 08:22:36 +08:00
Mridul Muralidharan
cc6bc8d18d [CELEBORN-1524] Support IPv6 hostnames for Apache Ratis
### What changes were proposed in this pull request?

Workaround a Apache Ratis bug in Celeborn until a new Ratis release with the fix is released which we can use.

### Why are the changes needed?

[RATIS-2131](https://issues.apache.org/jira/browse/RATIS-2131) has been fixed, and will be available in 3.2.0 - until it is released, this will work around the issue.

### Does this PR introduce _any_ user-facing change?

Fixes a bug

### How was this patch tested?

Manual testing in IPv6 env with hostnames for ratis config.

Closes #2646 from mridulm/workaround-RATIS-2131.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-26 16:43:09 +08:00
Sanskar Modi
afe40e6bb0
[CELEBORN-1519] Do not update estimated partition size if it is unchanged
### What changes were proposed in this pull request?

We will not update the estimated partition size if it is unchanged.

### Why are the changes needed?

Celeborn currently triggers an workerinfo update even-though the estimated partition size is not changed. This leads to unnecessary logging and redundant worker info update operations.

Example log -

```
[master-partition-size-updater] WARN org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager - Celeborn cluster estimated partition size changed from 64.0 MiB to 64.0 MiB
```

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?
Existing UT's

Closes #2642 from s0nskar/CELEBORN-1519.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-25 16:25:40 +08:00
SteNicholas
c0ca9523f9 [CELEBORN-1190][FOLLOWUP] Fix WARNING of error prone
### What changes were proposed in this pull request?

- Fix `WARNING` of error prone.
- Disable `EmptyCatch`, `JdkObsolete`, `MutableConstantField` and `UnnecessaryParentheses`.

### Why are the changes needed?

There are many `WARNING` generated by error prone. We should follow the suggestion of error prone to fix `WARNING`.

```
$ mvn clean install -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/sasl/SaslUtils.java:[44,25] [MutableConstantField] Constant field declarations should use the immutable type (such as ImmutableList) instead of the general collection interface type (such as List)
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/sasl/SaslUtils.java:[47,18] [MutableConstantField] Constant field declarations should use the immutable type (such as ImmutableList) instead of the general collection interface type (such as List)
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/client/TransportClientBootstrap.java:[34,5] [InvalidParam] Parameter name `channel` is unknown.
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/client/TransportResponseHandler.java:[96,29] [StaticAssignmentInConstructor] This assignment is to a static field. Mutating static state from a constructor is highly error-prone.
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/client/TransportResponseHandler.java:[104,30] [StaticAssignmentInConstructor] This assignment is to a static field. Mutating static state from a constructor is highly error-prone.
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/sasl/anonymous/AnonymousSaslServerFactory.java:[67,2] [ClassCanBeStatic] Inner class is non-static but does not reference enclosing class
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/meta/FileInfo.java:[60,17] [NonAtomicVolatileUpdate] This update of a volatile variable is non-atomic
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/util/TransportFrameDecoder.java:[54,46] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/ssl/ReloadingX509TrustManager.java:[207,29] [NonAtomicVolatileUpdate] This update of a volatile variable is non-atomic
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/ssl/ReloadingX509TrustManager.java:[216,28] [NonAtomicVolatileUpdate] This update of a volatile variable is non-atomic
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/sasl/anonymous/AnonymousSaslClientFactory.java:[73,2] [ClassCanBeStatic] Inner class is non-static but does not reference enclosing class
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/network/sasl/anonymous/AnonymousSaslClientFactory.java:[93,31] [DefaultCharset] Implicit use of the platform default charset, which can result in differing behaviour between JVM executions or incorrect behavior if the encoding of the data source doesn't match expectations.
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/util/ExceptionUtils.java:[65,11] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/common/src/main/java/org/apache/celeborn/common/util/ExceptionUtils.java:[66,11] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/ssl/SslSampleConfigs.java:[164,16] [JavaUtilDate] Date has a bad API that leads to bugs; prefer java.time.Instant or LocalDate.
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/ssl/SslSampleConfigs.java:[165,14] [JavaUtilDate] Date has a bad API that leads to bugs; prefer java.time.Instant or LocalDate.
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/ssl/SslSampleConfigs.java:[165,35] [JavaUtilDate] Date has a bad API that leads to bugs; prefer java.time.Instant or LocalDate.
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/SSLTransportClientFactorySuiteJ.java:[32,14] [MissingOverride] setUp overrides method in TransportClientFactorySuiteJ; expected Override
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/SSLTransportClientFactorySuiteJ.java:[40,14] [MissingOverride] tearDown overrides method in TransportClientFactorySuiteJ; expected Override
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/protocol/EncryptedMessageWithHeaderSuiteJ.java:[124,6] [UseCorrectAssertInTests] Java assert is used in test. For testing purposes Assert.* matchers should be used.
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/RpcIntegrationSuiteJ.java:[255,15] [UnusedMethod] Private method 'assertErrorAndClosed' is never used.
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/RpcIntegrationSuiteJ.java:[154,17] [UnusedNestedClass] This nested class is unused, and can be removed.
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/RpcIntegrationSuiteJ.java:[57,15] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/ssl/ReloadingX509TrustManagerSuiteJ.java:[107,10] [AssertThrowsMultipleStatements] The lambda passed to assertThrows should contain exactly one statement
[WARNING] /Users/nicholas/Github/celeborn/common/src/test/java/org/apache/celeborn/common/network/ssl/ReloadingX509TrustManagerSuiteJ.java:[134,10] [AssertThrowsMultipleStatements] The lambda passed to assertThrows should contain exactly one statement
[WARNING] /Users/nicholas/Github/celeborn/client/src/main/java/org/apache/celeborn/client/read/LocalPartitionReader.java:[84,31] [StaticAssignmentInConstructor] This assignment is to a static field. Mutating static state from a constructor is highly error-prone.
[WARNING] /Users/nicholas/Github/celeborn/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:[130,6] [ThreadLocalUsage] ThreadLocals should be stored in static fields
[WARNING] /Users/nicholas/Github/celeborn/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:[714,6] [MissingCasesInEnumSwitch] Non-exhaustive switch; either add a default or handle the remaining cases: SUCCESS, PARTIAL_SUCCESS, REQUEST_FAILED, and 43 others
[WARNING] /Users/nicholas/Github/celeborn/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:[1609,10] [MissingCasesInEnumSwitch] Non-exhaustive switch; either add a default or handle the remaining cases: PARTIAL_SUCCESS, REQUEST_FAILED, SHUFFLE_ALREADY_REGISTERED, and 45 others
[WARNING] /Users/nicholas/Github/celeborn/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:[1648,26] [MissingOverride] updateFileGroup implements method in ShuffleClient; expected Override
[WARNING] /Users/nicholas/Github/celeborn/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:[1654,57] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java:[1823,32] [MissingOverride] getDataClientFactory implements method in ShuffleClient; expected Override
[WARNING] /Users/nicholas/Github/celeborn/client/src/test/java/org/apache/celeborn/client/ShuffleClientSuiteJ.java:[185,6] [UseCorrectAssertInTests] Java assert is used in test. For testing purposes Assert.* matchers should be used.
[WARNING] /Users/nicholas/Github/celeborn/service/src/main/java/org/apache/celeborn/server/common/service/store/db/DbServiceManagerImpl.java:[70,33] [JavaUtilDate] Date has a bad API that leads to bugs; prefer java.time.Instant or LocalDate.
[WARNING] /Users/nicholas/Github/celeborn/service/src/main/java/org/apache/celeborn/server/common/service/store/db/DbServiceManagerImpl.java:[71,33] [JavaUtilDate] Date has a bad API that leads to bugs; prefer java.time.Instant or LocalDate.
[WARNING] /Users/nicholas/Github/celeborn/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/HARaftServer.java:[424,11] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/HARaftServer.java:[425,11] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/HARaftServer.java:[496,55] [UnescapedEntity] This looks like a type with type parameters. The < and > characters here will be interpreted as HTML, which can be avoided by wrapping it in a {code } tag.
[WARNING] /Users/nicholas/Github/celeborn/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/SingleMasterMetaManager.java:[166,14] [MissingOverride] handleUpdatePartitionSize implements method in IMetadataHandler; expected Override
[WARNING] /Users/nicholas/Github/celeborn/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java:[298,61] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/MapPartitionDataReader.java:[346,37] [NonAtomicVolatileUpdate] This update of a volatile variable is non-atomic
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java:[202,33] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java:[300,31] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java:[497,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java:[503,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/memory/MemoryManager.java:[513,39] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/CreditStreamManager.java:[256,12] [ClassCanBeStatic] Inner class is non-static but does not reference enclosing class
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/WorkerSecretRegistryImpl.java:[73,12] [CacheLoaderNull] The result of CacheLoader#load must be non-null.
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ReducePartitionDataWriter.java:[69,13] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ReducePartitionDataWriter.java:[73,13] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ReducePartitionDataWriter.java:[103,24] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ReducePartitionDataWriter.java:[104,39] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/MapPartitionDataWriter.java:[261,46] [ByteBufferBackingArray] ByteBuffer.array() shouldn't be called unless ByteBuffer.arrayOffset() is used or if the ByteBuffer was initialized using ByteBuffer.wrap() or ByteBuffer.allocate().
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ChunkStreamManager.java:[102,40] [NonAtomicVolatileUpdate] This update of a volatile variable is non-atomic
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ChunkStreamManager.java:[109,40] [NonAtomicVolatileUpdate] This update of a volatile variable is non-atomic
[WARNING] /Users/nicholas/Github/celeborn/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionFilesSorter.java:[318,39] [IntLongMath] Expression of type int may overflow before being assigned to a long
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/FetchHandlerSuiteJ.java:[133,6] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/network/SSLRequestTimeoutIntegrationSuiteJ.java:[32,14] [MissingOverride] setUp overrides method in RequestTimeoutIntegrationSuiteJ; expected Override
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/network/SSLRequestTimeoutIntegrationSuiteJ.java:[40,14] [MissingOverride] tearDown overrides method in RequestTimeoutIntegrationSuiteJ; expected Override
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/storage/ChunkFetchIntegrationSuiteJ.java:[74,15] [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/storage/ChunkFetchIntegrationSuiteJ.java:[186,47] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/storage/SSLReducePartitionDataWriterSuiteJ.java:[30,26] [MissingOverride] createModuleTransportConf overrides method in DiskReducePartitionDataWriterSuiteJ; expected Override
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/storage/local/DiskReducePartitionDataWriterSuiteJ.java:[234,47] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/worker/src/test/java/org/apache/celeborn/service/deploy/worker/storage/memory/MemoryReducePartitionDataWriterSuiteJ.java:[198,47] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
```
```
$ mvn clean install -Pspark-2.4 -pl client-spark/common,client-spark/spark-2 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SortBasedPusher.java:[109,57] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SendBufferPool.java:[56,14] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/client-spark/common/src/main/java/org/apache/spark/shuffle/celeborn/SendBufferPool.java:[57,21] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java:[247,14] [UnusedMethod] Private method 'executorCores' is never used.
[WARNING] /Users/nicholas/Github/celeborn/client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java:[120,55] [ReferenceEquality] Comparison using reference equality instead of value equality
```
```
$ mvn clean install -Pspark-3.5 -pl client-spark/spark-3 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/CelebornShuffleDataIO.java:[65,17] [MissingOverride] supportsReliableStorage implements method in ShuffleDriverComponents; expected Override
[WARNING] /Users/nicholas/Github/celeborn/client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java:[163,55] [ReferenceEquality] Comparison using reference equality instead of value equality
```
```
$ mvn clean install -Pflink-1.14 -pl client-flink/common,client-flink/flink-1.14 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/readclient/CelebornBufferStream.java:[161,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/readclient/CelebornBufferStream.java:[223,27] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/ShuffleTaskInfo.java:[46,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[99,66] [JdkObsolete] It is very rare for LinkedList to out-perform ArrayList or ArrayDeque. Avoid it unless you're willing to invest a lot of time into benchmarking. Caveat: LinkedList supports null elements, but ArrayDeque does not.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[236,21] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[251,19] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[267,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[354,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[392,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[473,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/RemoteShuffleInputGateDelegation.java:[533,17] [SynchronizeOnNonFinalField] Synchronizing on non-final fields is not safe: if the field is ever updated, different threads may end up locking on different objects.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/buffer/TransferBufferPool.java:[182,33] [MixedMutabilityReturnType] This method returns both mutable and immutable collections or maps from different paths. This may be confusing for users of the method.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/main/java/org/apache/celeborn/plugin/flink/utils/FlinkUtils.java:[34,6] [DoubleBraceInitialization] Prefer collection factory methods or builders to the double-brace initialization pattern.
[WARNING] /Users/nicholas/Github/celeborn/client-flink/common/src/test/java/org/apache/celeborn/plugin/flink/BufferPackSuiteJ.java:[207,6] [CatchAndPrintStackTrace] Logging or rethrowing exceptions should usually be preferred to catching and calling printStackTrace
[WARNING] /Users/nicholas/Github/celeborn/client-flink/flink-1.14/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionSuiteJ.java:[140,67] [CanonicalDuration] Duration can be expressed more clearly with different units
```
```
$ mvn clean install -Pflink-1.15 -pl client-flink/flink-1.15 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/client-flink/flink-1.15/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionSuiteJ.java:[140,67] [CanonicalDuration] Duration can be expressed more clearly with different units
```
```
$ mvn clean install -Pflink-1.17 -pl client-flink/flink-1.16 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
```
```
$ mvn clean install -Pflink-1.17 -pl client-flink/flink-1.17 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/client-flink/flink-1.17/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionSuiteJ.java:[140,67] [CanonicalDuration] Duration can be expressed more clearly with different units
```
```
$ mvn clean install -Pflink-1.18 -pl client-flink/flink-1.18 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/client-flink/flink-1.18/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionSuiteJ.java:[140,67] [CanonicalDuration] Duration can be expressed more clearly with different units
```
```
$ mvn clean install -Pflink-1.19 -pl client-flink/flink-1.19 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
[WARNING] /Users/nicholas/Github/celeborn/client-flink/flink-1.19/src/test/java/org/apache/celeborn/plugin/flink/RemoteShuffleResultPartitionSuiteJ.java:[140,67] [CanonicalDuration] Duration can be expressed more clearly with different units
```
```
$ mvn clean install -Pmr -pl client-mr/mr -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

```
$ mvn clean install -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pspark-2.4 -pl client-spark/common,client-spark/spark-2 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pspark-3.5 -pl client-spark/spark-3 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pflink-1.14 -pl client-flink/common,client-flink/flink-1.14 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pflink-1.15 -pl client-flink/flink-1.15 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pflink-1.16 -pl client-flink/flink-1.15 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pflink-1.17 -pl client-flink/flink-1.17 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pflink-1.18 -pl client-flink/flink-1.18 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pflink-1.19 -pl client-flink/flink-1.19 -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
$ mvn clean install -Pmr -pl client-mr/mr -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Dspotless.check.skip=true|grep WARNING|grep java
```

Closes #2555 from SteNicholas/CELEBORN-1190.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2024-07-25 01:21:15 -05:00
Sanskar Modi
0a68ae049c
[CELEBORN-1520] Minor logging fix for AppDiskUsageMetric and Fixed UTs
### What changes were proposed in this pull request?

- Minor logging change in AppDiskUsageMetric which are not critical but slightly bothering.
- Fixed UT's for AppDiskUsageMetric

### Why are the changes needed?

1. Current AppDiskUsageMetric UTs were like placeholder and just printing the output. They were not testing/verifying anything.
2. Minor logging change with AppDiskUsageMetric.
- Comma formating was wrong

```
Snapshot start 2024-07-24T08:47:12.496 end 2024-07-24T08:57:12.497 Application application_1717149813731_19042841_2 used approximate 15.9 GiB ,Application application_1717149813731_19042841_1 used approximate 13.9 GiB
```

- We were printing an extra empty line after each summary.

```
211:20:24.339 [master-app-disk-usage-metrics-logger] INFO  org.apache.celeborn.common.meta.AppDiskUsageMetric - App Disk Usage Top50 Report
Snapshot start 2024-07-24T09:17:12.498 end 2024-07-24T09:27:12.498 Application application_XXX used approximate 14.5 GiB
Snapshot start 2024-07-24T08:17:12.495 end 2024-07-24T08:27:12.496 Application application_XXX used approximate 15.9 GiB

11:27:12.507 [master-app-disk-usage-metrics-logger] INFO  org.apache.celeborn.common.meta.AppDiskUsageMetric - App Disk Usage Top50 Report
Snapshot start 2024-07-24T09:17:12.498 end 2024-07-24T09:27:12.498 Application application_XXX used approximate 14.5 GiB
Snapshot start 2024-07-24T08:17:12.495 end 2024-07-24T08:27:12.496 Application application_XXX used approximate 15.9 GiB
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Fixed current UTs and verified from the logs.

Closes #2643 from s0nskar/app_disk_usage.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-25 13:47:12 +08:00
Wang, Fei
e92d59ace5 [CELEBORN-1522] Fix applicationId extraction from shuffle key
### What changes were proposed in this pull request?

Fix applicationId extraction from shuffle key.

### Why are the changes needed?

For spark on k8s, the applicationId might be `spark-da4571bd2cbf491c892cbd4de40fc918`.

Due the application extraction is not correct, the result of API `/api/v1/applications/top_disk_usages` is not correct.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Not needed, just leverage existing method.

Closes #2645 from turboFei/fix_topdisk_usages.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2024-07-24 21:50:09 -07:00
zhaohehuhu
7a596bbed1 [CELEBORN-1469] Support writing shuffle data to OSS(S3 only)
### What changes were proposed in this pull request?

as title

### Why are the changes needed?

Now, Celeborn doesn't support sinking shuffle data directly to Amazon S3, which could be a limitation when we're trying to move on-premises servers to AWS and use S3 as a data sink for shuffled data.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #2579 from zhaohehuhu/dev-0619.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-07-24 11:59:15 +08:00
Wang, Fei
8b7c2b3f12 [CELEBORN-1477][FOLLOWUP] Fix api v1 response issue
### What changes were proposed in this pull request?

1. Fix below api response:

- master GET /api/v1/masters
- master GET /api/v1/applications/top_disk_usages
- master&worker /api/v1/thread_dump

2. Fix typo in migration guide

3. refine the api annotation: METHOD -> PATH

4. enhance the `RestExceptionMapper`
### Why are the changes needed?

For /api/v1/masters, the `id` field is not in good format.
```
{
"groupId": "c5196f6d-2c34-3ed3-8b8a-47bede733167",
"leader": {
"id": "<ByteString4960c29e size=1 contents=\"0\">",
"address": "...:9872"
},
...
}
```

For `/api/v1/applications/top_disk_usages`, it thrown NPE, we shall filter the null items.
```
24/07/18 21:52:38,506 WARN [master-JettyThreadPool-40] RestExceptionMapper: Error occurs on accessing REST API.
java.lang.NullPointerException
	at org.apache.celeborn.service.deploy.master.http.api.v1.ApplicationResource.$anonfun$topDiskUsedApplications$2(ApplicationResource.scala:78)
```

For `api/v1/thread_dump`, seems need to add `Produces(Array(MediaType.APPLICATION_JSON))`:
```
Caused by: javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error
	at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:65)
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139)
	at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1116)
	at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:649)
	at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:380)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:426)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:264)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
	... 36 more
Caused by: org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyWriter not found for media type=text/html, type=class scala.collection.immutable.Map$Map1, genericType=class scala.collection.immutable.Map$Map1.
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:224)
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139)
	at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:85)
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139)
	at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:61)
	... 51 more
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Integration testing.

For `api/v1/masters`:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/c0908d05-aebc-435a-8446-038dd18fb7cd">

For master `api/v1/applications/top_disk_usages`:
<img width="559" alt="image" src="https://github.com/user-attachments/assets/50860735-9975-449a-9f77-24d8eafd2018">

For `api/v1/thread_dump`:
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/9844de22-45c6-46ba-9260-c8a7d28c2e1d">

Closes #2637 from turboFei/fix_id_info.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2024-07-22 19:02:36 -07:00
Fei Wang
bd3f8236d0 [CELEBORN-1317][FOLLOWUP] Fix media type annotations for form urlencoded APIs
### What changes were proposed in this pull request?

This PR is a follow up for https://github.com/apache/celeborn/pull/2495, fix the media types.

### Why are the changes needed?

The media types shown in the swagger UI are not correct.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Before:
<img width="1439" alt="image" src="https://github.com/apache/celeborn/assets/6757692/f287c02b-791c-4677-93b7-ac9c5e4ee34f">
After:
<img width="1341" alt="image" src="https://github.com/apache/celeborn/assets/6757692/13e5d310-7c97-4872-9496-f9b12113b7ab">

Closes #2616 from turboFei/form_app.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-11 23:51:50 +08:00
SteNicholas
adbef7b441 [CELEBORN-1499] Bump Ratis version from 3.0.1 to 3.1.0
### What changes were proposed in this pull request?

Bump Ratis version from 3.0.1 to 3.1.0. Meanwhile, remove `CelebornStateMachineStorage` with the release of https://github.com/apache/ratis/pull/1111.

### Why are the changes needed?

Bump Ratis version from 3.0.1 to 3.1.0. Ratis has released v3.1.0, of which release note refers to [3.1.0](https://ratis.apache.org/post/3.1.0.html). The 3.1.0 version is a minor release with multiple improvements and bugfixes including [[RATIS-2111] Reinitialize should load the latest snapshot](https://issues.apache.org/jira/browse/RATIS-2111). See the [changes between 3.0.1 and 3.1.0](https://github.com/apache/ratis/compare/ratis-3.0.1...ratis-3.1.0) releases.

Follow up #2547.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterStateMachineSuiteJ#testInstallSnapshot`

Closes #2610 from SteNicholas/CELEBORN-1499.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-07-11 16:29:58 +08:00
Fei Wang
d698a69edc
[CELEBORN-1477][CIP-9] Refine the celeborn RESTful APIs
### What changes were proposed in this pull request?

This PR is for [CIP-9 Refine the celeborn RESTful APIs](https://docs.google.com/document/d/1LV2vV-w3XtlbJj2Vi4J77mt4IYCr40-8A_JncZLsHqs/edit?usp=sharing).

We leverage [openapi-generator](https://github.com/OpenAPITools/openapi-generator) to generate the client and model code.

### Why are the changes needed?

Celeborn has implemented RESTful APIs for monitoring and administrative operations on both master and worker endpoints. These APIs enable tasks such as configuration checks, status viewing of master/worker nodes, worker decommissioning/recommissioning, and more. They provide crucial insights and support for DevOps.
The primary concern with the existing API is the response content type, which is `text/plain` rather than the more widely accepted `application/json`. This mismatch makes integration with DevOps tools challenging, as these tools typically require JSON-formatted responses for seamless parsing and automation.
And I also saw the need for REST API evolution in[ Apache Celeborn CLI Proposal](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-7+Celeborn+CLI).

### Does this PR introduce _any_ user-facing change?
This pr introduce  a new API namespace: `/api/v1`. This approach allows us to maintain the current API for compatibility while offering an improved version.

### How was this patch tested?
UT.

Closes #2599 from turboFei/cip_9_openapi.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-07-11 10:57:00 +08:00
Fei Wang
5dfbfd2840 [CELEBORN-1475] Fix unknownExcludedWorkers filter for /exclude request
### What changes were proposed in this pull request?

Now for unknown excluded workers filter,
```
    val unknownExcludedWorkers =
      (workersToAdd ++ workersToRemove).filter(!statusSystem.workers.contains(_))
```

The `workersToAdd` and `workersToRemove` are in `Array[String]` type, and `statusSystem.workers` is in `Set<WorkerInfo>` type.

In this pr,  `workersToAdd` and `workersToRemove` is in `List[WorkerInfo]` type.

### Why are the changes needed?

As title.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Closes #2586 from turboFei/fix_work_filter.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-06-23 03:43:18 +08:00
Fei Wang
5cea9cc7f2
[CELEBORN-1318] Support celeborn http authentication
### What changes were proposed in this pull request?
Support celeborn master/worker http authentication.

### Why are the changes needed?
Authentication is needed for celeborn admin APIs.

### Does this PR introduce _any_ user-facing change?
Yes, introduce authentication related config items, but does not break the current behavior.

### How was this patch tested?

Added UT for BASIC and Bearer authentication.

Closes #2440 from turboFei/http_auth.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-06-20 10:35:12 +08:00
Xianming Lei
cb30e911e5 [CELEBORN-1452] Master follower node metadata is out of sync after installing snapshot
### What changes were proposed in this pull request?
Fix Master follower node metadata is out of sync after installing snapshot

### Why are the changes needed?
Follower node metadata is out of sync, when a master-slave switchover occurs, there are major risks to the stability of the cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UT.

Closes #2547 from leixm/issue_1452.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-06-13 17:09:12 +08:00
Erik.fang
5323c1d009 [CELEBORN-1337] Remove unused fields from HeartbeatFromApplicationResponse
as discussed in https://github.com/apache/celeborn/pull/2398, this PR removed unused fields from HeartbeatFromApplicationResponse, without adding WorkerId Type

Closes #2529 from ErikFang/remove-unused-fields-HeartbeatFromApplicationResponse.

Authored-by: Erik.fang <fmerik@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-06-13 10:15:00 +08:00
Xianming Lei
999510b265 [CELEBORN-1444] Introduce worker decommission metrics and corresponding REST API
### What changes were proposed in this pull request?

Introduce worker decommission metrics and corresponding REST API.

### Why are the changes needed?

In a production environment, due to certain hardware or environmental reasons, our script will automatically decommission the node. At this time, we need to distinguish between graceful shutdown nodes and decommissioned nodes.

If we distinguish shutdown worker and decommission worker metrics, we can achieve better operation and maintenance.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testHandleReportWorkerDecommission`
- `RatisMasterStatusSystemSuiteJ#testHandleReportWorkerDecommission`
- `ApiMasterResourceSuite#decommissionWorkers`
- `ApiWorkerResourceSuite#isDecommissioning`

Closes #2535 from leixm/issue_1444.

Lead-authored-by: Xianming Lei <jerrylei@apache.org>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-06-08 11:10:31 +08:00
SteNicholas
2a57fab869 [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1
### What changes were proposed in this pull request?

Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes:

- RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(https://github.com/apache/ratis/pull/964)
- RATIS-1677. Do not auto format RaftStorage in RECOVER.(https://github.com/apache/ratis/pull/718)
- RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)

### Why are the changes needed?

Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to [3.0.0](https://ratis.apache.org/post/3.0.0.html), [3.0.1](https://ratis.apache.org/post/3.0.1.html). The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including:

- 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including:
   - Incompatible Changes
      - RaftStorage Auto-Format
      - RATIS-1677. Do not auto format RaftStorage in RECOVER. (https://github.com/apache/ratis/pull/718)
      - RATIS-1694. Fix the compatibility issue of RATIS-1677. (https://github.com/apache/ratis/pull/731)
      - RATIS-1871. Auto format RaftStorage when there is only one directory configured. (https://github.com/apache/ratis/pull/903)
      - Pluggable Ratis-Metrics (RATIS-1688)
      - RATIS-1689. Remove the use of the thirdparty Gauge. (https://github.com/apache/ratis/pull/728)
      - RATIS-1692. Remove the use of the thirdparty Counter. (https://github.com/apache/ratis/pull/732)
      - RATIS-1693. Remove the use of the thirdparty Timer. (https://github.com/apache/ratis/pull/734)
      - RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (https://github.com/apache/ratis/pull/741)
      - RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (https://github.com/apache/ratis/pull/742)
      - RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)
      - RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (https://github.com/apache/ratis/pull/751)
      - RATIS-1391. Update library dropwizard.metrics version to 4.x (https://github.com/apache/ratis/pull/632)
      - RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (https://github.com/apache/ratis/pull/671)
      - Streaming Protocol Change
      - RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (https://github.com/apache/ratis/pull/635)
   - New Features
      - Leader Lease (RATIS-1864)
      - RATIS-1865. Add leader lease bound ratio configuration (https://github.com/apache/ratis/pull/897)
      - RATIS-1866. Maintain leader lease after AppendEntries (https://github.com/apache/ratis/pull/898)
      - RATIS-1894. Implement ReadOnly based on leader lease (https://github.com/apache/ratis/pull/925)
      - RATIS-1882. Support read-after-write consistency (https://github.com/apache/ratis/pull/913)
      - StateMachine API
      - RATIS-1874. Add notifyLeaderReady function in IStateMachine (https://github.com/apache/ratis/pull/906)
      - RATIS-1897. Make TransactionContext available in DataApi.write(..). (https://github.com/apache/ratis/pull/930)
      - New Configuration Properties
      - RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (https://github.com/apache/ratis/pull/896)
      - RATIS-1930. Add a conf for enable/disable majority-add. (https://github.com/apache/ratis/pull/961)
      - RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (https://github.com/apache/ratis/pull/950)
      - RATIS-1636. Support re-config ratis properties (https://github.com/apache/ratis/pull/800)
      - RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (https://github.com/apache/ratis/pull/901)
   - Improvements & Bug Fixes
      - Netty
         - RATIS-1898. Netty should use EpollEventLoopGroup by default (https://github.com/apache/ratis/pull/931)
         - RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (https://github.com/apache/ratis/pull/932)
         - RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (https://github.com/apache/ratis/pull/955)
         - RATIS-1923. Netty: atomic operations require side-effect-free functions. (https://github.com/apache/ratis/pull/956)
      - RaftServer
         - RATIS-1924. Increase the default of raft.server.log.segment.size.max. (https://github.com/apache/ratis/pull/957)
         - RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (https://github.com/apache/ratis/pull/923)
         - RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge https://github.com/apache/ratis/pull/922 (https://github.com/apache/ratis/pull/922)
         - RATIS-761. Handle writeStateMachineData failure in leader. (https://github.com/apache/ratis/pull/927)
         - RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (https://github.com/apache/ratis/pull/933)
         - RATIS-1912. Fix infinity election when perform membership change. (https://github.com/apache/ratis/pull/954)
         - RATIS-1858. Follower keeps logging first election timeout. (https://github.com/apache/ratis/pull/894)

- 3.0.1:This is a bugfix release. See the [changes between 3.0.0 and 3.0.1](https://github.com/apache/ratis/compare/ratis-3.0.0...ratis-3.0.1) releases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster manual test.

Closes #2480 from SteNicholas/CELEBORN-1400.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-30 17:22:22 +08:00
Fei Wang
493e0f10cf [CELEBORN-1317][FOLLOWUP] Fix threadDump UT stuck issue
### What changes were proposed in this pull request?

Try to fix ApiWorkerResourceSuite::threadDump UT stuck issue.
1. Using program way to get thread dump.

Related code copied from apache/spark
https://github.com/apache/spark/blob/v3.5.1/core/src/main/scala/org/apache/spark/util/Utils.scala
https://github.com/apache/spark/blob/v3.5.1/core/src/main/scala/org/apache/spark/status/api/v1/api.scala

### Why are the changes needed?
I found that sometimes the UT stuck for threadDump api:
For example: https://github.com/apache/celeborn/actions/runs/8462056188/job/23182806487?pr=2428
<img width="1291" alt="image" src="https://github.com/apache/celeborn/assets/6757692/f39d7bb9-6e31-4ce3-a573-1ff86f335318">

<img width="762" alt="image" src="https://github.com/apache/celeborn/assets/6757692/437592dd-fc9c-404d-a452-834fcf630bd1">

threadDump api UT is new introduced in [CELEBORN-1317](https://issues.apache.org/jira/browse/CELEBORN-1317).

Before there is no UT to cover that, and now it stuck sometimes.

And for getThreadDump, before it leverages processBuilder to get the thread info.

I wonder that the process is stuck because of some unknown reason, so, in this pr, we try to use program way to get thread info.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

UT.

![image](https://github.com/apache/celeborn/assets/6757692/51aaa44e-0523-4b60-b6c8-f4e83c709497)

Closes #2429 from turboFei/thread_dump.

Lead-authored-by: Fei Wang <fwang12@ebay.com>
Co-authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-27 15:12:50 +08:00
Shuang
308eed28c9 [CELEBORN-1427] Add Capacity metrics for Celeborn
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
The Celeborn cluster does not currently provide metrics for 'TotalCapacity' and 'TotalFreeCapacity

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

Closes #2521 from RexXiong/CELEBORN-1427.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-23 16:06:11 +08:00
Mridul Muralidharan
a13d167617 [CELEBORN-1401] Add SSL support for ratis communication
### What changes were proposed in this pull request?

When SSL is enabled for master, secure the Ratis communication as well with TLS

### Why are the changes needed?

Currently, when TLS is enabled for RPC, Ratis comms still goes in the clear - add support for TLS.
Note that currently this only supports GRPC, and not netty.

### Does this PR introduce _any_ user-facing change?
Secures ratis communication when TLS is enabled at master for rpc.

### How was this patch tested?
Local tests and additional unit tests added

Closes #2515 from mridulm/CELEBORN-1401-add-ratis-ssl-support.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-17 17:08:11 +08:00
Shuang
8a10a2d465 [CELEBORN-1421] Refine code in master to reduce unnecessary sync to get workers/lostworkers/shutdownWorkers
### What changes were proposed in this pull request?

1. Use ConcurrentSet to replace ArrayList for workers.
2. Remove unnecessary sync and snapshot when get workers/lostworkers/shutdownWorkers

### Why are the changes needed?

1. Reduce unnecessary sync to get workers/lostworkers/shutdownWorkers.
2. Somewhere in the Master, directly using statusSystem.workers(ArrayList) is not safe, potentially leading to concurrent modification issues.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #2507 from RexXiong/CELEBORN-1421.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-17 14:06:37 +08:00
mingji
8dd33ceef8 [CELEBORN-1270] Introduce PbPackedPartitionLocations to (de-)serialize PartitionLocations more efficiently
### What changes were proposed in this pull request?
1. Introduces new approaches to (de-)serialize partition locations.
2. The Celeborn server remains compatible with old clients.

### Why are the changes needed?
1. Improve memory efficiency for partition locations.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
1. Pass GA.
2. Run tests on cluster:
```
val start = System.currentTimeMillis
spark.sparkContext.parallelize(1 to 10000, 10000).flatMap( _ => (1 to 950000).iterator.map(num => num)).repartition(10000).count
val after = System.currentTimeMillis
println((after-start)/1000)
```
packed RPC time: 70,65,64,64,64,64
baseline RPC time: 69,66,66,66,67,66

I think this PR does not introduce performance overhead.

4. RPC size test: this PR can reduce PRC size by up to 60%.

Closes #2456 from FMX/CELEBORN-1270.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-11 13:50:02 +08:00
SteNicholas
db163bd793 [CELEBORN-1317][FOLLOWUP] Improve parameters, description and document of REST API
### What changes were proposed in this pull request?

Improve parameters, description and document of Celeborn REST API, including:

1. The POST request uses `FormParam` instead of `QueryParam`.
2. The parameter name uses lowercase instead of uppercase.
3. The description of `/exclude` aligns with document in `monitoring.md`.
4. The document of `REST API` adds the `Method` and `Parameters` to document GET/POST method and corresponding interface.

### Why are the changes needed?

The parameters, description and document of REST API need to improve after http server refine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2495 from SteNicholas/CELEBORN-1317.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-09 17:41:13 +08:00
Shuang
993d3f2587 [CELEBORN-1398] Support return leader ip to client
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Currently, if accessing services of a Celeborn cluster across Kubernetes clusters, one may encounter DNS resolution issues. However, connectivity may be achieved through IP addresses when combined with the Kubernetes setting hostNetwork=true for clients from different clusters. At present, the `celeborn.network.bind.preferIpAddress` configuration is only effective on worker nodes. This PR will enable the feature of returning the leader's IP when accessing the master node.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #2489 from RexXiong/CELEBORN-1398.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-05-08 15:01:55 +08:00
Shuang
9a9abfe3bc [CELEBORN-1245][FOLLOWUP] Fix SendWorkerEvent in HA mode
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Handle worker event use wrong request.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`RatisMasterStatusSystemSuiteJ#testHandleWorkerEvent`

Closes #2493 from RexXiong/CELEBORN-1245-FOLLOW-UP.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-05-07 15:16:47 +08:00
SteNicholas
2c76a6e429 [CELEBORN-1384] Manually excluding workers should not depend on whether the workers are alive
### What changes were proposed in this pull request?

Manually excluding workers should not depend on whether the workers are alive or not for master.

### Why are the changes needed?

When the workers are offline, master could not add or remove workers through manually excluding workers. Therefore, master should support manually excluding workers no matter whether the workers are alive or not.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #2465 from SteNicholas/CELEBORN-1384.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-04-18 18:09:55 +08:00
Mridul Muralidharan
f27ede42c4
[CELEBORN-1356] Split rpc module into rpc_app and rpc_service
### What changes were proposed in this pull request?

Split the `rpc` transport module into `rpc_app` and `rpc_service` to allow for them to be independently configured.

### Why are the changes needed?

We need the ability to independently configure communication between application components (driver/executors in spark applications) and those to/from Celeborn service (master/workers) components.

This is particularly relevant for TLS support where applications might be running with TLS disabled for their rpc services or using self-signed certificates (see CELEBORN-1354 for an example), while services would have signed certs.

### Does this PR introduce _any_ user-facing change?

Yes, it allows users to independently configure rpc env within the application and those to/from services.
Backward compatibility is maintained - and so existing `rpc` is the fallback in case `rpc_app` or `rpc_service` config is not found.

### How was this patch tested?

Unit tests were enhanced, existing tests pass.

Closes #2460 from mridulm/split_rpc_module-retry1.

Lead-authored-by: Mridul Muralidharan <mridul@gmail.com>
Co-authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-04-17 14:59:23 +08:00
Aravind Patnam
f04ebccd4d
[CELEBORN-1368] Log celeborn config for debugging purposes
### What changes were proposed in this pull request?
Log celeborn config for debugging purposes.

### Why are the changes needed?
Help with debugging

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
tested the patch internally.

Closes #2442 from akpatnam25/CELEBORN-1368.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-04-08 15:11:35 +08:00
Chandni Singh
0d72c95958 [CELEBORN-1365] Ensure that a client cannot update the metadata belonging to a different application
### What changes were proposed in this pull request?
This ensures that an authenticated client does not update the metadata belonging to another application.

### Why are the changes needed?
The changes are needed for authentication support.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #2441 from otterc/CELEBORN-1365.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-04-08 10:35:13 +08:00
Mridul Muralidharan
186899f53f
[CELEBORN-1371] Update ratis with internal port endpoint address as well (#2446)
* Update ratis with internal port endpoint address as well, and propagate it to workers, while keeping existing path for applications the same
---------

Co-authored-by: Mridul Muralidharan <mridulatgmail.com>
2024-04-05 14:42:03 -04:00
Mridul Muralidharan
f63adc51e4
[CELEBORN-1370] Exception with authentication is enabled when creating send-application-meta thread pool
### What changes were proposed in this pull request?

Change the initialization order, so that `sendApplicationMetaThreads` has been initialized before the dispatcher initalizes for master.
Currently it ends up being `0` as `onStart` ends up getting called as part of object creation - before `sendApplicationMetaThreads` has been initialized (and so ends up with default value of `0`).

### Why are the changes needed?

Ensure `sendApplicationMetaExecutor` is created when auth is enabled, and rest of `Master.onStart` completes.

### Does this PR introduce _any_ user-facing change?

No, fixes a bug in master.

### How was this patch tested?

Local deployment, existing unit tests.

Closes #2445 from mridulm/CELEBORN-1370.

Lead-authored-by: Mridul Muralidharan <mridul@gmail.com>
Co-authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-04-05 08:47:58 +08:00
Fei Wang
ceed216a39 [CELEBORN-1317][FOLLOWUP] Retry to setup mini cluster if the cause is BindException
### What changes were proposed in this pull request?
To fix the UT for http server port already in use issue.

For Jetty HttpServer, if failed to bind port, the exception is IOException and the cause is BindException, we should retry for that.

Before:
```
    case e: BindException => // retry to setup mini cluster
```

Now:
```
    case e: IOException
         if e.isInstanceOf[BindException] || Option(e.getCause).exists(
           _.isInstanceOf[BindException]) =>  // retry to setup mini cluster
```

### Why are the changes needed?

To fix the UT for http server port already in use issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Will trigger GA for 3 three times.

Closes #2424 from turboFei/set_connector_stop_timeout.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-28 10:28:47 +08:00
Fei Wang
adbc77cd4f [CELEBORN-1317] Refine celeborn http server and support swagger ui
### What changes were proposed in this pull request?

Before, there is no http request spec likes query param, http method and response mediaType.
And for each api, a HttpEndpoint class is needed.

In this PR, we refine the code for http service and provide swagger ui.

Note that: This pr does not change the orignal api request and response behavior, including metrics APIs.

TODO:
1. define DTO
2. http request authentication

<img width="1900" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/7f8c2363-170d-4bdf-b2c9-74260e31d3e5">

<img width="1138" alt="image" src="https://github.com/apache/incubator-celeborn/assets/6757692/3ae6ec8e-00a8-475b-bb37-0329536185f6">

### Why are the changes needed?

To close CELEBORN-1317

### Does this PR introduce _any_ user-facing change?

The api is align with before.

### How was this patch tested?
UT.

Closes #2371 from turboFei/jetty.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-27 23:18:18 +08:00
lvshuang.xjs
9497d557e6
[CELEBORN-1345] Add a limit to the master's estimated partition size
### What changes were proposed in this pull request?
Currently, the Celeborn master calculates the estimatedPartitionSize based on the fileInfo committed by the application. This estimate is then used to allocate slots across all workers. However, this partition size may be too large or too small for Celeborn. For example, if an application commits a single file of 1TB to only one worker, using that partition size could result in all other workers having no available slots or only very small slots. To improve this, it would be better to implement a cap on the master's estimated partition size to prevent such imbalances.

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2412 from RexXiong/CELEBORN-1345.

Lead-authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com>
Co-authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-25 14:40:47 +08:00
Aravind Patnam
ca64bb5143 [CELEBORN-1313] Custom Network Location Aware Replication
### What changes were proposed in this pull request?

Enable custom network location aware replication, based on a custom impl of `DNSToSwitchMapping`.

### Why are the changes needed?

Resolution of network location of multiple workers at master can be expensive at times. This way, each worker resolves its own network location and sends to master via the RegisterWorker transport message. If worker cannot resolve, fallback to attempting to resolve at master (during update meta or reload of snapshot). Proposal: [Celeborn Custom Network Location Aware Replication](https://docs.google.com/document/d/11M_MKKnIXCTExJHMX-OMTq7SBpkl8fJMlpy8hLgmev0/edit#heading=h.s3vnydz589z5)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Updated the unit tests.

Closes #2367 from akpatnam25/CELEBORN-1313.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-13 11:10:30 +08:00