Commit Graph

602 Commits

Author SHA1 Message Date
SteNicholas
4dfcd9b56b [CELEBORN-1092] Introduce JVM monitoring in Celeborn Worker using JVMQuake
### What changes were proposed in this pull request?

Introduce JVM monitoring in Celeborn Worker using JVMQuake to enable early detection of memory management issues and facilitate fast failure.

### Why are the changes needed?

When facing out-of-control memory management in Celeborn Worker we typically use JVMkill as a remedy by killing the process and generating a heap dump for post-analysis. However, even with jvmkill protection, we may still encounter issues caused by JVM running out of memory, such as repeated execution of Full GC without performing any useful work during the pause time. Since the JVM does not exhaust 100% of resources, JVMkill will not be triggered. Therefore JVMQuake is introduced to provide more granular monitoring of GC behavior, enabling early detection of memory management issues and facilitating fast failure. Refers to the principle of [jvmquake](https://github.com/Netflix-Skunkworks/jvmquake) which is a JVMTI agent that attaches to your JVM and automatically signals and kills it when the program has become unstable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`JVMQuakeSuite`

Closes #2061 from SteNicholas/CELEBORN-1092.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-28 20:45:08 +08:00
mingji
113311df3e [CELEBORN-1081][FOLLOWUP] Remove UNKNOWN_DISK and allocate all slots to disk
### What changes were proposed in this pull request?
1. Remove UNKNOWN_DISK from StorageInfo.
2. Enable load-aware slots allocation when there is HDFS.

### Why are the changes needed?
To support the application's config about available storage types.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
GA and Cluster.

Closes #2098 from FMX/B1081-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-28 11:26:00 +08:00
Shuang
ad57c8b91e
[CELEBORN-1052] Introduce dynamic ConfigService at SystemLevel and TenantLevel
### What changes were proposed in this pull request?
This PR introduce dynamic ConfigService at SystemLevel and TenantLevel, Dynamic configuration is a type of configuration that can be changed at runtime as needed. It can be used at system level/tenant level. When applying dynamic configuration, the priority order is as follows: tenant level overrides system level, which in turn overrides static configuration(CelebornConf). This means that if a configuration is defined at the tenant level, it will be used instead of the system level or static configuration(CelebornConf). If the tenant-level configuration is missing,
the system-level configuration will be used. If the system-level configuration is also missing, CelebornConf
will be used as the default value.

There are several other tasks related to this feature that will be implemented in the future.

- [ ]  [Add isDynamic property for CelebornConf](https://issues.apache.org/jira/browse/CELEBORN-1051)
- [ ]  [Support DB based Configserver](https://issues.apache.org/jira/browse/CELEBORN-1054)
- [ ]  [Add restAPI for configuration management](https://issues.apache.org/jira/browse/CELEBORN-1056)

### Why are the changes needed?
The current configuration of the server (CelebornConf) is static. When the configuration is changed, the service needs to be restarted. This PR introduces a dynamic configuration solution. The server side can use dynamic configuration as needed. At the same time, it is considered that the tenant level will be supported in the future (such as supporting tenant level dynamic quota control) configuration, so this time we will also consider supporting dynamic tenant-level configuration, and this PR will provide a default implementation based on the file system.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #2100 from RexXiong/CELEBORN-1052.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-27 12:17:05 +08:00
Erik.fang
aee41555c6 [CELEBORN-955] Re-run Spark Stage for Celeborn Shuffle Fetch Failure
### What changes were proposed in this pull request?
Currently, Celeborn uses replication to handle shuffle data lost for celeborn shuffle reader, this PR implements an alternative solution by Spark stage resubmission.

Design doc:
https://docs.google.com/document/d/1dkG6fww3g99VAb1wkphNlUES_MpngVPNg8601chmVp8/edit

### Why are the changes needed?
Spark stage resubmission uses less resources compared with replication, and some Celeborn users are also asking for it

### Does this PR introduce _any_ user-facing change?
a new config celeborn.client.fetch.throwsFetchFailure is introduced to enable this feature

### How was this patch tested?
two UTs are attached, and we also tested it in Ant Group's Dev spark cluster

Closes #1924 from ErikFang/Re-run-Spark-Stage-for-Celeborn-Shuffle-Fetch-Failure.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-26 16:47:58 +08:00
Chandni Singh
788b0c340b [CELEBORN-1135] Added tests for the RpcEnv and related classes
### What changes were proposed in this pull request?
Added test suites for `RpcEnv`, `NettyRpcEnv`, and other related classes.
These are copied over from Apache Spark. Some of the UTs in Apache Spark required changes in the source code like [SPARK-39468](https://issues.apache.org/jira/browse/SPARK-39468) which I didn't copy over.

### Why are the changes needed?
The change adds unit tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Just adds UTs. The source code changes are minimal.

Closes #2107 from otterc/CELEBORN-1135.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-24 09:57:04 +08:00
SteNicholas
60871750e4
[CELEBORN-1136] Support policy for master to assign slots fallback to roundrobin with no available slots
### What changes were proposed in this pull request?

`SlotsAllocator` supports policy for master to assign slots fallback to roundrobin with no available slots.

### Why are the changes needed?

When the selected workers have no available slots, the loadaware policy could throw `MasterNotLeaderException`. It's recommended to support policy for master to assign slots fallback to roundrobin with no available slots. Meanwhile, the situation that there is no available slots would occur when the partition size has increased a lot in a short period of time.
```
Caused by: org.apache.celeborn.common.haclient.MasterNotLeaderException: Master:xx.xx.xx.xx:9099 is not the leader. Suggested leader is Master:xx.xx.xx.xx:9099. Exception:bound must be positive.
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HAHelper.sendFailure(HAHelper.java:58)
    at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:236)
    at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:314)
    ... 7 more
Caused by: java.lang.IllegalArgumentException: bound must be positive
    at java.util.Random.nextInt(Random.java:388)
    at org.apache.celeborn.service.deploy.master.SlotsAllocator.roundRobin(SlotsAllocator.java:202)
    at org.apache.celeborn.service.deploy.master.SlotsAllocator.offerSlotsLoadAware(SlotsAllocator.java:151)
    at org.apache.celeborn.service.deploy.master.Master.$anonfun$handleRequestSlots$1(Master.scala:598)
    at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:199)
    at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:189)
    at org.apache.celeborn.service.deploy.master.Master.handleRequestSlots(Master.scala:587)
    at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$12(Master.scala:314)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:233)
    ... 8 more
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SlotsAllocatorSuiteJ#testAllocateSlotsWithNoAvailableSlots`

Closes #2108 from SteNicholas/CELEBORN-1136.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-22 14:08:06 +08:00
SteNicholas
a275b64b32
[CELEBORN-1137] Correct suggested leader of exception message for MasterNotLeaderException
### What changes were proposed in this pull request?

`MasterNotLeaderException` corrects the suggested leader of exception message.

### Why are the changes needed?

When current peer isn't the leader of master and the leader is switching which cache isn't expired, the suggested leader of exception message in MasterNotLeaderException is confusing that the suggested leader is current peer. It's recommened to correct suggested leader of exception message for MasterNotLeaderException if current peer is equal to the suggested leader.
```
Caused by: org.apache.celeborn.common.haclient.MasterNotLeaderException: Master:xx.xx.xx.xx:9099 is not the leader. Suggested leader is Master:xx.xx.xx.xx:9099. Exception:bound must be positive.
    at org.apache.celeborn.service.deploy.master.clustermeta.ha.HAHelper.sendFailure(HAHelper.java:58)
    at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:236)
    at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.applyOrElse(Master.scala:314)
    ... 7 more
Caused by: java.lang.IllegalArgumentException: bound must be positive
    at java.util.Random.nextInt(Random.java:388)
    at org.apache.celeborn.service.deploy.master.SlotsAllocator.roundRobin(SlotsAllocator.java:202)
    at org.apache.celeborn.service.deploy.master.SlotsAllocator.offerSlotsLoadAware(SlotsAllocator.java:151)
    at org.apache.celeborn.service.deploy.master.Master.$anonfun$handleRequestSlots$1(Master.scala:598)
    at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:199)
    at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:189)
    at org.apache.celeborn.service.deploy.master.Master.handleRequestSlots(Master.scala:587)
    at org.apache.celeborn.service.deploy.master.Master$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$12(Master.scala:314)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.celeborn.service.deploy.master.Master.executeWithLeaderChecker(Master.scala:233)
    ... 8 more
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2109 from SteNicholas/CELEBORN-1137.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-22 10:26:53 +08:00
Fu Chen
aab073ab16
[CELEBORN-1125] Bump guava from 14.0.1 to 32.1.3-jre
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

- bump guava from 14.0.1 to 32.1.3-jre
- refer to https://github.com/apache/spark/pull/26911, remove usages of Guava that no longer work in Guava 27/32, and replace with workalikes. After this PR, Celeborn no longer relies on a specific version of Guava, and is compatible with Guava 14/27/32. we have the ability to specify Guava to 27 when running MapReduce integration tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #2090 from cfmcgrady/guava-27.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-21 16:18:14 +08:00
onebox-li
b5c5aa6d9d [CELEBORN-1121] Improve WorkerInfo#hashCode method
### What changes were proposed in this pull request?
Change WorkerInfo#hashCode() from map+foldLeft to while and cache.

Test the each way to calculate, code and result show as below:
```
val state = Seq(host, rpcPort, pushPort, fetchPort, replicatePort)
// origin
val originHash = state.map(_.hashCode()).foldLeft(0)((a, b) => 31 * a + b)

// for
var forHash = 0
for (i <- state) {
  forHash = 31 * forHash + i.hashCode()
}

// while
var whileHash = 0
var i = 0
while (i < state.size) {
  whileHash = 31 * whileHash + state(i).hashCode()
  i = i + 1
}
```
Result:
```
java version "1.8.0_261"
origin hash result = -831724440, costs 1103914 ns
for hash result = -831724440, costs 444588 ns (2.5x)
while hash result = -831724440, costs 46510 ns (23x)
```

### Why are the changes needed?
The current WorkerInfo's hashCode() is a little time-consuming. Since it is widely used in lots of hash maps, it needs to be improved.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #2086 from onebox-li/improve-worker-hash.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-11-17 10:31:57 +08:00
mingji
02cea042a0 [CELEBORN-1116] Read authentication configs from HADOOP_CONF_DIR
### What changes were proposed in this pull request?
1. Make Celeborn read configs from HADOOP_COND_DIR.
2. Remove unnecessary Kerberos configs.

### Why are the changes needed?
To support HDFS with Kerberos.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #2082 from FMX/B1116.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Fu Chen <cfmcgrady@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-09 11:07:13 +08:00
Shuang
931880a82d [CELEBORN-1112] Inform celeborn application is shutdown, then celeborn cluster can release resource immediately
### What changes were proposed in this pull request?
Unregister application to Celeborn master After Application stopped, then master will expire the related shuffle resource immediately, resulting in resource savings.

### Why are the changes needed?
Currently Celeborn master expires the related application shuffle resource only when application is being checked timeout,
this would greatly delay the release of resources, which is not conducive to saving resources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
PASS GA

Closes #2075 from RexXiong/CELEBORN-1112.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 20:46:51 +08:00
onebox-li
b7e4dc4339 [CELEBORN-1114] Remove allocationBuckets from WorkerInfo and refactor SLOTS_ALLOCATED metrics
### What changes were proposed in this pull request?
Currently, `WorkerInfo` is used in many places, and allocationBuckets is only used when its own workers want to collect metrics `SLOTS_ALLOCATED`. If there are lots of workers in the RSS cluster, there may be a certain amount of memory waste, each `WorkerInfo` maintain a Array\[Int](61), so remove it from `WorkerInfo`.
And refactor the metrics `SLOTS_ALLOCATED` from gauge to counter. Originally, this metrics is approximately one hour's total only if there are continuous tasks. Now refactoring it into a counter can reduce the cost of maintaining time windows, including storage and timely expiration data, etc. It can also be more flexibly transformed according to user needs on the prometheus side.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Yes. metrics_SlotsAllocated_Count metrics change from gauge for 1 hour to a increasing counter.

### How was this patch tested?
Cluster test.

Closes #2078 from onebox-li/improve-SlotsAllocated.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-08 19:45:47 +08:00
SteNicholas
d2582919ad
[CELEBORN-1110] Support celeborn.worker.storage.disk.reserve.ratio to configure worker reserved ratio for each disk
### What changes were proposed in this pull request?

Support `celeborn.worker.storage.disk.reserve.ratio` to configure worker reserved ratio for each disk.

### Why are the changes needed?

`CelebornConf` supports to configure celeborn worker reserved space for each disk, which space is absolute. `CelebornConf` could support `celeborn.worker.storage.disk.reserve.ratio` to configure worker reserved ratio for each disk. The minimum usable size for each disk should be the max space between the reserved space and the space calculate via reserved ratio.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SlotsAllocatorSuiteJ`

Closes #2071 from SteNicholas/CELEBORN-1110.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-08 12:39:25 +08:00
SteNicholas
52eddc59f3
[CELEBORN-448] Support exclude worker manually
### What changes were proposed in this pull request?

Support exclude worker manually given worker id. This worker is added into excluded workers manually.

### Why are the changes needed?

Celeborn supports to shuffle client-side fetch and push exclude workers on failure at present. It's necessary to exclude worker manually for maintaining the Celeborn cluster.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `HttpUtilsSuite`
- `DefaultMetaSystemSuiteJ#testHandleWorkerExclude`
- `RatisMasterStatusSystemSuiteJ#testHandleWorkerExclude`
- `MasterStateMachineSuiteJ#testObjSerde`

Closes #1997 from SteNicholas/CELEBORN-448.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-07 16:25:24 +08:00
joey.ljy
455cd40137 [CELEBORN-1111] Supporting connection to HDFS with Kerberos authentication enabled
### What changes were proposed in this pull request?
Adding Kerberos support for HDFS storage type.

The following five parameters need to be configured:
| key | value |
| :--: | :--: |
| celeborn.storage.hdfs.kerberos.enabled | true |
| celeborn.storage.hdfs.kerberos.principal | userREALM |
| celeborn.storage.hdfs.kerberos.keytab | /path/test.keytab |
| celeborn.hadoop.hadoop.security.authorization | kerberos |
| celeborn.hadoop.dfs.namenode.kerberos.principal | hdfs/_HOSTREALM |

### Why are the changes needed?
Connecting to HDFS with Kerberos enabled requires support for keytab login.

### Does this PR introduce _any_ user-facing change?
Add 3 configurations.
celeborn.storage.hdfs.kerberos.enabled
celeborn.storage.hdfs.kerberos.principal
celeborn.storage.hdfs.kerberos.keytab

### How was this patch tested?
Test in Kerberos enabled HDFS cluster.

Closes #2072 from liujiayi771/hdfs-kerberos.

Authored-by: joey.ljy <joey.ljy@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-04 17:21:41 +08:00
mingji
5e77b851c9 [CELEBORN-1081] Client support celeborn.storage.activeTypes config
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".

### Why are the changes needed?
Enable client application to select storage types to use.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
GA and cluster.

Closes #2045 from FMX/B1081.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-03 20:03:11 +08:00
Chandni Singh
c8b5384baf [CELEBORN-1107] Make the max default number of netty threads configurable
### What changes were proposed in this pull request?
This change makes the maximum default number of Netty threads configurable. Previously, this value was hardcoded to 64, which could be small for certain environments. While it's possible to configure the number of Netty server and client threads individually for each module, providing an option to increase the default value offers greater convenience.

### Why are the changes needed?
The change offers convenience.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a UT

Closes #2065 from otterc/CELEBORN-1107.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-03 13:18:44 +08:00
onebox-li
7b185a2562 [CELEBORN-1058] Support specifying the number of dispatcher threads for each role
### What changes were proposed in this pull request?
Support specifying the number of dispatcher threads for each role, especially shuffle client side. For shuffle client, there is only RpcEndpointVerifier endpoint which handles not many requests, one thread is enough. The rpc env of other roles has only two endpoints at most, using a shared event loop is reasonable. I am not sure if there is a need to add rpc requests to shuffle client. So add specific parameters to specify the dispatcher threads here.

And change the dispatcher thread pool name in order to distinguish it from spark's.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Yes, add params celeborn.\<role>.rpc.dispatcher.threads

### How was this patch tested?
Manual test and UT

Closes #2003 from onebox-li/my_dev.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-03 10:35:54 +08:00
SteNicholas
4e8e8c2310
[CELEBORN-1094] Optimize mechanism of ChunkManager expired shuffle key cleanup to avoid memory leak
### What changes were proposed in this pull request?

The `cleaner` of `Worker` executes the `StorageManager#cleanupExpiredShuffleKey` to clean expired shuffle keys with daemon cached thread pool. The optimization speeds up cleaning including expired shuffle keys of ChunkManager to avoid memory leak.

### Why are the changes needed?

`ChunkManager#streams` could lead memory leak when the speed of cleanup is slower than expiration for expired shuffle of worker. The behavior that `ChunkStreamManager` cleanup expired shuffle key should be optimized to avoid memory leak, which causes that the VM thread of worker is 100%.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WorkerSuite#clean up`.

Closes #2053 from SteNicholas/CELEBORN-1094.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-02 15:46:07 +08:00
SteNicholas
b45b63f9a5
[CELEBORN-247][FOLLOWUP] Add metrics for each user's quota usage of Celeborn Worker
### What changes were proposed in this pull request?

Add the metric `ResourceConsumption` to monitor each user's quota usage of Celeborn Worker.

### Why are the changes needed?

The metric `ResourceConsumption` supports to monitor each user's quota usage of Celeborn Master at present. The usage of Celeborn Worker also needs to monitor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2059 from SteNicholas/CELEBORN-247.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-11-01 15:48:31 +08:00
onebox-li
320714bf24 [CELEBORN-1089] Seperate overHighWatermark check to a dedicated thread
### What changes were proposed in this pull request?
Seperate `overHighWatermark` check to a dedicated thread, let this value can shared better and lighten `CongestionController#isUserCongested` logic.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test and UT.

Closes #2041 from onebox-li/congest-check.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-01 09:51:24 +08:00
onebox-li
11fe324a08 [CELEBORN-1093] Improve setup endpoint
### What changes were proposed in this pull request?
Avoid transform RpcEndpointAddress -> celebornUrl -> RpcEndpointAddress when setupEndpointRef.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test.

Closes #2049 from onebox-li/improve-setup-endpoint.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-31 21:39:59 +08:00
xiyu.zk
2ce8d6fd95 [CELEBORN-1102] Optimize the performance of getAllPrimaryLocationsWithMinEpoch
### What changes were proposed in this pull request?
Optimize the performance of getAllPrimaryLocationsWithMinEpoch

### Why are the changes needed?
#### Before optimization:
![image](https://github.com/apache/incubator-celeborn/assets/107825064/0ccbf503-99b7-45db-a8bd-6539e854d011)

#### After optimization:
![image](https://github.com/apache/incubator-celeborn/assets/107825064/0cb54276-a089-44dc-9b75-6649537515f2)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2058 from kerwin-zk/issue-1102.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-10-31 20:37:17 +08:00
SteNicholas
3092644168 [CELEBORN-1095] Support configuration of fastest available XXHashFactory instance for checksum of Lz4Decompressor
### What changes were proposed in this pull request?

`CelebornConf` adds `celeborn.client.shuffle.decompression.lz4.xxhash.instance` to configure fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor`. Fix #2043.

### Why are the changes needed?

`Lz4Decompressor` creates the checksum with `XXHashFactory#fastestInstance`, which returns the fastest available `XXHashFactory` instance that uses nativeInstance at default. The fastest available `XXHashFactory` instance for checksum of `Lz4Decompressor` could be supported to configure instead of dependency on the class loader is the system class loader, which method is as follows:
```
/**
 * Returns the fastest available {link XXHashFactory} instance. If the class
 * loader is the system class loader and if the
 * {link #nativeInstance() native instance} loads successfully, then the
 * {link #nativeInstance() native instance} is returned, otherwise the
 * {link #fastestJavaInstance() fastest Java instance} is returned.
 * <p>
 * Please read {link #nativeInstance() javadocs of nativeInstance()} before
 * using this method.
 *
 * return the fastest available {link XXHashFactory} instance.
 */
public static XXHashFactory fastestInstance() {
  if (Native.isLoaded()
      || Native.class.getClassLoader() == ClassLoader.getSystemClassLoader()) {
    try {
      return nativeInstance();
    } catch (Throwable t) {
      return fastestJavaInstance();
    }
  } else {
    return fastestJavaInstance();
  }
}
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `CelebornConfSuite`
- `ConfigurationSuite`

Closes #2050 from SteNicholas/CELEBORN-1095.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
2023-10-31 14:57:31 +08:00
xiyu.zk
cf194a5e3a [CELEBORN-1097] Optimize the retrieval of configuration in the internalCreateClient
### What changes were proposed in this pull request?
Optimize the retrieval of configuration in the internalCreateClient

### Why are the changes needed?
Directly accessing configuration information through 'conf.xx' in 'internalCreateClient' is time-consuming.
![image](https://github.com/apache/incubator-celeborn/assets/107825064/315a5013-5dfb-4a44-bf1b-109fc4ecc654)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2055 from kerwin-zk/client-factory-conf.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-31 09:56:34 +08:00
SteNicholas
df40a28959 [CELEBORN-1032][FOLLOWUP] Use scheduleWithFixedDelay instead of scheduleAtFixedRate in threads pool of master and worker
### What changes were proposed in this pull request?

Use `scheduleWithFixedDelay` instead of `scheduleAtFixedRate` in thread pool of Celeborn Master and Worker.

### Why are the changes needed?

Follow up #1970.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #2048 from SteNicholas/CELEBORN-1032.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-27 11:20:28 +08:00
sychen
5d3ae318bf [CELEBORN-665][FOLLOWUP] Skip empty app snapshot logs
### What changes were proposed in this pull request?

### Why are the changes needed?
`topNItems` is never empty, but may be all null.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2046 from cxzl25/CELEBORN-665_FOLLOWUP.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-27 09:50:43 +08:00
Fu Chen
447c243601 [CELEBORN-1075] Refactor MetricsSystem and AbstractSource to use synchronized blocks
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, during my testing on the main branch, we encountered `java.util.ConcurrentModificationException` error. This PR addresses synchronization issues in the MetricsSystem and AbstractSource classes by introducing synchronized blocks to ensure thread safety.

1. the `MetricsSystem#sources` collection has been changed from the `mutable.ArrayBuffer` to the `CopyOnWriteArrayList`, to prevent potential thread safety issues
2. the `AbstractSource#namedGauges` collection has been changed from the `ArrayList` to the `ConcurrentLinkedQueue` to enhance thread safety when adding gauges. to fix:

```
java.util.ConcurrentModificationException
        at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:911)
        at java.util.ArrayList$Itr.next(ArrayList.java:861)
        at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
        at scala.collection.Iterator.foreach(Iterator.scala:943)
        at scala.collection.Iterator.foreach$(Iterator.scala:943)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
        at scala.collection.IterableLike.foreach(IterableLike.scala:74)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
        at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
        at scala.collection.TraversableLike.to(TraversableLike.scala:786)
        at scala.collection.TraversableLike.to$(TraversableLike.scala:783)
        at scala.collection.AbstractTraversable.to(Traversable.scala:108)
        at scala.collection.TraversableOnce.toList(TraversableOnce.scala:350)
        at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:350)
        at scala.collection.AbstractTraversable.toList(Traversable.scala:108)
        at org.apache.celeborn.common.metrics.source.AbstractSource.gauges(AbstractSource.scala:146)
        at org.apache.celeborn.common.metrics.source.AbstractSource.getMetrics(AbstractSource.scala:401)
        at org.apache.celeborn.common.metrics.sink.PrometheusServlet.$anonfun$getMetricsSnapshot$1(PrometheusServlet.scala:42)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.apache.celeborn.common.metrics.sink.PrometheusServlet.getMetricsSnapshot(PrometheusServlet.scala:42)
        at org.apache.celeborn.common.metrics.sink.PrometheusHttpRequestHandler.handleRequest(PrometheusServlet.scala:59)
        at org.apache.celeborn.server.common.http.HttpRequestHandler.channelRead0(HttpRequestHandler.scala:53)
        at org.apache.celeborn.server.common.http.HttpRequestHandler.channelRead0(HttpRequestHandler.scala:37)
```

### Does this PR introduce _any_ user-facing change?

No, only bug fix

### How was this patch tested?

Pass GA

Closes #2023 from cfmcgrady/synchronized-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-24 21:57:00 +08:00
Fu Chen
349ee8b1cb Revert "[CELEBORN-255] Add counter of outstandingFetches, outstanding…
…Rpcs and outstandingPushes to metrics"

This reverts commit bfa341c32f.

### What changes were proposed in this pull request?

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1992#issuecomment-1776760369

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2032 from cfmcgrady/revert-pr-1992.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-24 17:18:54 +08:00
SteNicholas
49ea881037
[MINOR] Remove unnecessary increment index of Master#timeoutDeadWorkers
### What changes were proposed in this pull request?

Remove unnecessary increment index of `Master#timeoutDeadWorkers`.

### Why are the changes needed?

Increment index of `Master#timeoutDeadWorkers` is unnecessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2027 from SteNicholas/timeout-dead-workers.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-23 22:18:39 +08:00
Mridul Muralidharan
eb382b018c
[CELEBORN-1072] Fix misc error prone reports found
Fix misc error prone reports.
As detailed in the jira, they are:
* Reference equality of boxed primitive types: see [BoxedPrimitiveEquality](https://errorprone.info/bugpattern/BoxedPrimitiveEquality)
* Calling run directly - since use is legitimate, mark it as ignore. See: [DoNotCall](https://errorprone.info/bugpattern/DoNotCall)
* `Ignore` test instead of catching `AssertionError` and ignoring it. See: [AssertionFailureIgnored](https://errorprone.info/bugpattern/AssertionFailureIgnored)

Fix misc error prone reports.

No

Unit tests

Closes #2019 from mridulm/fix-misc-issues.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-23 11:15:10 +08:00
Mridul Muralidharan
5fb3680b13
[CELEBORN-1068] Fix hashCode computation
The `hashCode` for an array does not hash the content - but just the identity array reference.
This was identified as part of enabling error prone (See #2016)
See more [here](https://errorprone.info/bugpattern/ArrayHashCode)

Fix bug with `hashCode` computation as identified by error-prone

No

Existing unit tests

Closes #2017 from mridulm/fix-hashcode-computation.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-23 11:14:43 +08:00
onebox-li
f7783249f5 [MINOR] Fix ShutdownHookManager#shutdownExecutor log's unit
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test

Closes #2004 from onebox-li/fix-shutdown-log.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-19 22:30:47 +08:00
jiaoqingbo
7456d9a0d2
[MINOR] Delete redundant Loggers
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PASS GA

Closes #2001 from jiaoqingbo/minor.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-19 18:47:45 +08:00
Fu Chen
8bf7e5259d [CELEBORN-1047] Remove conf celeborn.worker.sortPartition.eagerlyRemoveOriginalFiles.enabled
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

The config key `celeborn.worker.sortPartition.eagerlyRemoveOriginalFiles.enabled` has become unnecessary as a result of #1932

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1999 from cfmcgrady/celeborn-1047.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-18 16:08:42 +08:00
mingji
69defcad7f [CELEBORN-1021] Celeborn support arbitary Ratis configs and client rpc timeout
### What changes were proposed in this pull request?
1. To support arbitrary Ratis configs
2. To support Ratis client rpc timeout

### Why are the changes needed?
After some digs that I found out Celeborn never changed the default config of ratis client's timeout.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster.

Closes #1969 from FMX/CELEBORN-1021.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-18 10:26:11 +08:00
sunjunjie
03498ce46b [CELEBORN-1046] Add an expiration time configuration for app directory to clean up
### What changes were proposed in this pull request?
Add a configuration "celeborn.worker.storage.expireDirs.timeout" with a default value of 6h in rsswork. This configuration is used to set the expiration time for app local directories.

https://issues.apache.org/jira/browse/CELEBORN-1046
### Why are the changes needed?
When Celeborn periodically deletes the directories of apps, it determines whether the app needs to be deleted based on the shuffleKeySet in memory. However, this method may not accurately indicate the completion of the app and could potentially lead to the unintentional deletion of shuffle data.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1998 from wilsonjie/CELEBORN-1046.

Authored-by: sunjunjie <sunjunjie@zto.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-17 19:23:49 +08:00
jiaoqingbo
efc36ebdba [CELEBORN-1043] Convert variable ‘metric’ from String to StringBuilder in toMetric method
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #1996 from jiaoqingbo/1043.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-17 16:51:29 +08:00
SteNicholas
9244cf2cf2 [CELEBORN-772] Convert StreamChunkSlice, ChunkFetchRequest, TransportableError to PB
### What changes were proposed in this pull request?

`StreamChunkSlice`, `ChunkFetchRequest` and `TransportableError` should merge to transport messages to enhance celeborn's compatibility.

### Why are the changes needed?

1. Improves celeborn's transport flexibility to change RPC.
2. Makes Compatible with 0.2 client.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `FetchHandlerSuiteJ`
- `RequestTimeoutIntegrationSuiteJ`
- `ChunkFetchIntegrationSuiteJ`

Closes #1982 from SteNicholas/CELEBORN-772.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-17 11:12:01 +08:00
SteNicholas
bfa341c32f [CELEBORN-255] Add counter of outstandingFetches, outstandingRpcs and outstandingPushes to metrics
### What changes were proposed in this pull request?

Add counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` to metrics of Celeborn Worker.

### Why are the changes needed?

The counter of `outstandingFetches`, `outstandingRpcs` and `outstandingPushes` of `TransportResponseHandler` could be added to metrics to monitor `outstandingFetches`, `outstandingRpcs` and `outstandingPushes`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`TransportResponseHandlerSuiteJ`

Closes #1992 from SteNicholas/CELEBORN-255.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 21:16:57 +08:00
sychen
a8ac18f2e8 [CELEBORN-299] Deprecate celeborn.worker.storage.baseDir.prefix and celeborn.worker.storage.baseDir.number
### What changes were proposed in this pull request?

<img width="1460" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/ac3b29be-7c39-4c18-b71d-0e243797273e">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.prefix' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
23/10/16 03:31:13,399 WARN [pool-1-thread-1-ScalaTest-running-CelebornConfSuite] CelebornConf: The configuration key 'celeborn.worker.storage.baseDir.number' has been deprecated in v0.4.0 and may be removed in the future. Please use celeborn.worker.storage.dirs
```

Closes #1993 from cxzl25/CELEBORN-299.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 19:10:13 +08:00
onebox-li
9a90ac8b6d [CELEBORN-1036] Map task hangs at limitZeroInFlight due to duplicate onFailure called
### What changes were proposed in this pull request?
In our test jobs, we found few map tasks may hang at InFlightRequestTracker#limitZeroInFlight (both
 prepareForMergeData and mapEndInternal can occurs) when worker unexpected shutdown. We add logs to trace InFlightRequestTracker#totalInflightReqs and found this adder may become negative In the above case.

When worker suddenly shutdown, the channel connection raise exception.
If NioEventLoop.processSelectedKeys is doing read, the exceptionCaught will be called. In TransportResponseHandler#exceptionCaught will call failOutstandingRequests and each request‘s onFailure callback.
```
WARN [data-client-5-9] TransportChannelHandler: Exception in connection from /xx
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.celeborn.shaded.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:256)
	at org.apache.celeborn.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at org.apache.celeborn.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
	at org.apache.celeborn.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)
ERROR [data-client-5-9] ShuffleClientImpl: Push data to xx failed for shuffle 0 map 11 attempt 0 partition 178 batch 634, remain revive times 5.
```
Next NioEventLoop start to `runAllTasks` in the finally block.If there is push write task, PushChannelListener.handleFailure will be called because of the closing channel. Here callback.onFailure may have a data race on `outstandingPushes`.
```
ERROR [data-client-5-9] ShuffleClientImpl: Push data to xx failed for shuffle 0 map 11 attempt 0 partition 178 batch 634, remain revive times 4.
org.apache.celeborn.common.exception.CelebornIOException: Failed to send request PUSH 1264 to /xx: org.apache.celeborn.shaded.io.netty.channel.StacklessClosedChannelException, channel will be closed
	at org.apache.celeborn.common.network.client.TransportClient$PushChannelListener.handleFailure(TransportClient.java:382)
	at org.apache.celeborn.common.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:325)
	at org.apache.celeborn.common.network.client.TransportClient$PushChannelListener.operationComplete(TransportClient.java:373)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:877)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at org.apache.celeborn.shaded.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:113)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at org.apache.celeborn.shaded.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:879)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.celeborn.shaded.io.netty.channel.StacklessClosedChannelException
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
```

Duplicate callback.onFailure will lead to totalInflightReqs count exception.

Here race will not be too severe and only occur under exception situation. So I think synchronize a lock is enough to avoid race.

### Why are the changes needed?
Increase robustness.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1978 from onebox-li/fix-handle-channel-failure.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 20:13:32 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
onebox-li
8b1bd07905
[CELEBORN-1037] Incorrect output for metrics of Prometheus
### What changes were proposed in this pull request?
The new added `deadlocks` metrics in `ThreadStatesGaugeSet` is a Set<String>, which is invalid. So here add a filter at the `addGauge` extrance.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Remove the unused metrics. BTW the template use `metrics_jvm_thread_deadlock_count_Value`

### How was this patch tested?
Manual test

Closes #1981 from onebox-li/fix-1037.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-13 11:18:03 +08:00
sychen
61fadd57bd [CELEBORN-665] Skip empty app snapshot logs
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1973 from cxzl25/CELEBORN-665.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 21:12:15 +08:00
onebox-li
a47f6169d8 [MINOR] Fix some typos
### What changes were proposed in this pull request?
Fix some typos

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
-

Closes #1983 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 20:34:07 +08:00
SteNicholas
84a318f716 [CELEBORN-1033] MasterNotLeaderException should provide the cause of exception
### What changes were proposed in this pull request?

`HAHelper#sendFailure` only sends `MasterNotLeaderException` without cause, which causes that the actual exception of `MasterNotLeaderException` could not catch for troubleshooting.

### Why are the changes needed?

`MasterNotLeaderException` provides the cause of exception.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterClientSuiteJ`

Closes #1972 from SteNicholas/CELEBORN-1033.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-11 20:18:58 +08:00
sychen
9c07ceddb0 [CELEBORN-1028][FOLLOWUP][DOCS] Make prometheus path configurable
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1965#issuecomment-1755345813

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="1410" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6454133a-040b-4dde-84b7-dbf08fb15b13">

<img width="1401" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/3cdfa9f2-9a7a-43cb-9006-77810a350669">

Closes #1974 from cxzl25/CELEBORN-1028-FOLLOWUP.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 22:59:22 +08:00
sychen
bcf89da7dd [MINOR] Fix typo in CelebornConf
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1971 from cxzl25/typo.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 20:04:16 +08:00
sychen
f6d27609b8 [CELEBORN-1028] Make prometheus path configurable
### What changes were proposed in this pull request?
`celeborn.metrics.prometheus.path`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1965 from cxzl25/CELEBORN-1028.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 18:37:44 +08:00