Commit Graph

1656 Commits

Author SHA1 Message Date
Mridul Muralidharan
21d5698a90 [CELEBORN-1339] Mark connection as timedOut in TransportClient.close
### What changes were proposed in this pull request?

Importing details from https://github.com/apache/spark/pull/43162:

--
This PR avoids a race condition where a connection which is in the process of being closed could be returned by the TransportClientFactory only to be immediately closed and cause errors upon use.

This race condition is rare and not easily triggered, but with the upcoming changes to introduce SSL connection support, connection closing can take just a slight bit longer and it's much easier to trigger this issue.

Looking at the history of the code I believe this was an oversight in https://github.com/apache/spark/pull/9853.

--

### Why are the changes needed?

We are working towards adding TLS support, which is essentially based on Spark 4.0 TLS support, and this is one of the fixes from there.
(I am yet to file the overall TLS support jira yet, but this is enabling work).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #2400 from mridulm/add-SPARK-45375.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-20 08:50:14 +08:00
sychen
9938143736 [MINOR] Fix typo in TransportClient
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2406 from cxzl25/TransportClient_typo.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-20 08:41:59 +08:00
MiyueSC
4562320626 [CELEBORN-1305][FOLLOWUP] Unify application module naming
### What changes were proposed in this pull request?

Unify application module naming.

### Why are the changes needed?

Unify application module naming.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2405 from miyuesc/fix-naming.

Authored-by: MiyueSC <913784771@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-19 21:09:30 +08:00
MiyueSC
0f42d59472 [CELEBORN-1304] Introduce tenant module for dashboard frontend
### What changes were proposed in this pull request?

Tenant list:

![image](https://github.com/apache/incubator-celeborn/assets/50617660/405b6d42-8eb7-4f7b-95b8-648cafa4b1eb)

Tenant detail:

![image](https://github.com/apache/incubator-celeborn/assets/50617660/5d702dce-6a4a-488e-bf62-12df6e490f8e)

### Why are the changes needed?

The tenant module provides Celeborn with information about the tenant.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2402 from miyuesc/CELEBORN-1304.

Authored-by: MiyueSC <913784771@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-19 20:39:51 +08:00
MiyueSC
2246b33728 [CELEBORN-1305][FOLLOWUP] Unify application module naming
### What changes were proposed in this pull request?

Unify application module naming.

### Why are the changes needed?

Keep the file name style of each module unified.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2403 from miyuesc/CELEBORN-1305-followup.

Authored-by: MiyueSC <913784771@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-19 18:04:28 +08:00
tiny-dust
ab252c2cf7 [CELEBORN-1306] Introduce master module for dashboard frontend
### What changes were proposed in this pull request?

Master list:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/bb440a75-eaf3-4386-b8da-5f0b39d5ed23)

Master Detail:
 Workers:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/1cf30db1-51f0-45d1-a587-eafd2cedfa6e)
 Application:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/0d1eccd6-6329-47d1-acac-5dbd58a461f8)
 Configuration:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/8db518af-e34a-4dc9-8148-568ff9067cae)
 FlameGraph:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/3224ab21-75a1-4f82-b2b9-81eba096611e)
 ThreadDump:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/c0fb017e-382f-4da7-88e3-949be6a5ce03)
 Metrics:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/b053c802-b952-4edf-a90c-7dae5ef2e53b)
 Logs:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/9edfb9da-5de4-4b88-9df2-4a98b697f51a)
 LogList:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/ff53b1c6-695f-4e95-a6fc-d85884cfffda)

### Why are the changes needed?

The Master module provides Celeborn with information about the master.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2401 from tiny-dust/CELEBORN-1306.

Authored-by: tiny-dust <idioticzhou@foxmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-19 15:01:33 +08:00
sychen
91f6378682 [CELEBORN-1336] Remove client partition split pool
### What changes were proposed in this pull request?

### Why are the changes needed?
`CELEBORN-1320` uses `ReviveManager` to batch processing SOFT_SPLIT event RPC, so `partitionSplitPool` is no longer used, and the configuration item `celeborn.client.push.splitPartition.threads` is meaningless.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2396 from cxzl25/CELEBORN-1336.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-18 21:48:59 +08:00
ForVic
15b0f16f74 [MINOR] Fix typo in developer docs - overview
### What changes were proposed in this pull request?
To fix a typo.

### Why are the changes needed?
To maintain the quality of Celeborn documentation.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
N/A

Closes #2397 from ForVic/forvic/fix_typo.

Authored-by: ForVic <victor.lakers0@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-17 16:15:16 +08:00
SteNicholas
12c3779805 [CELEBORN-1330] Bump rocksdbjni version from 8.5.3 to 8.11.3
### What changes were proposed in this pull request?

Bump `rocksdbjni` version from 8.5.3 to 8.11.3.

### Why are the changes needed?

The new version bring some bug fixes:

- Fix a corner case with auto_readahead_size where Prev Operation returns NOT SUPPORTED error when scans direction is changed from forward to backward.
- Avoid destroying the periodic task scheduler's default timer in order to prevent static destruction order issues.
- Fix double counting of BYTES_WRITTEN ticker when doing writes with transactions.
- Fix a WRITE_STALL counter that was reporting wrong value in few cases.
- A lookup by MultiGet in a TieredCache that goes to the local flash cache and finishes with very low latency, i.e before the subsequent call to WaitAll, is ignored, resulting in a false negative and a memory leak.
- Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
- Fixed some cases in which DB file corruption was detected but ignored on creating a backup with BackupEngine.
- Fix bugs where rocksdb.blobdb.blob.file.synced includes blob files failed to get synced and rocksdb.blobdb.blob.file.bytes.written includes blob bytes failed to get written.
- Fixed a possible memory leak or crash on a failure (such as I/O error) in automatic atomic flush of multiple column families.
- Fixed some cases of in-memory data corruption using mmap reads with BackupEngine, sst_dump, or ldb.
- Fixed issues with experimental preclude_last_level_data_seconds option that could interfere with expected data tiering.
- Fixed the handling of the edge case when all existing blob files become unreferenced. Such files are now correctly deleted.

The full release notes as follows: [rocksdbjni releases](https://github.com/facebook/rocksdb/releases).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2389 from SteNicholas/CELEBORN-1330.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-14 18:01:03 +08:00
tiny-dust
22c70124e9 [CELEBORN-1307][FOLLOWUP] Introduce worker detail module for dashboard frontend
### What changes were proposed in this pull request?

Introduce worker detail module for dashboard frontend:

To detail page:
![output](https://github.com/apache/incubator-celeborn/assets/49502875/64f158c7-d791-4427-81c8-2ca767f4778f)

Storages:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/ee83256c-b925-479a-aa96-3252d5808c3a)

Memory:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/9d33095f-9e6c-4b41-af51-06c840c56146)

Application:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/54f786f6-433c-402f-aa80-da9c13710322)

Configuration:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/2ece1933-45f7-4de6-b483-471c57aae249)

FlameGraph:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/259be92a-46a8-4052-bf6b-7ec0c5ede92c)

ThreadDump:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/e7260fbe-5e95-4718-9ba3-15ec5e61f0f7)

Metrics:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/2d117ff0-621f-48c3-90c6-fa33567b67da)

Logs:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/b8fc7c30-614f-4dec-8370-a058c7ea1c66)

LogList:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/96e2f9a4-ffbb-428b-885b-bd163c99f6a8)

### Why are the changes needed?

Show more `worker` information.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local test.

Closes #2393 from tiny-dust/worker-detail.

Authored-by: tiny-dust <idioticzhou@foxmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-14 17:35:36 +08:00
SteNicholas
d33ac28945 [MINOR] Fix typo of celeborn.network.bind.preferIpAddress doc
### What changes were proposed in this pull request?

Fix typo of `celeborn.network.bind.preferIpAddress` doc from `ture` to `true`.

### Why are the changes needed?

`celeborn.network.bind.preferIpAddress` doc has typo for `ture`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2392 from SteNicholas/prefer-ip-address.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-14 17:12:33 +08:00
mingji
5043f46aa7 [CELEBORN-863][FOLLOWUP] Fix persisted committed file infos lost
### What changes were proposed in this pull request?
To fix a bug that might cause persisted committed file info lost.

### Why are the changes needed?
A worker starts will clean its persisted committed file info and won't put back if this worker restart again, the committed file infos will lost.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA.

Closes #2390 from FMX/b863-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-03-14 13:51:10 +08:00
Curtis Howard
23269d77e9
[CELEBORN-1327] Support Spark 3.5 with JDK21
### What changes were proposed in this pull request?
Updates Celeborn to account for [JDK-8303083](https://bugs.openjdk.org/browse/JDK-8303083), which affects JDK21.  (similar to the Apache Spark change here:
https://github.com/apache/spark/pull/39909)

### Why are the changes needed?
Without this change, Spark users of Celeborn will encounter a runtime error similar to the following:
`Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.IllegalStateException: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long, int) [in thread "Executor task launch worker for task 0.0 in stage 0.0 (TID 0)"]
        at org.apache.celeborn.common.unsafe.Platform.<clinit>(Platform.java:135)
        ... 16 more`

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tested using standalone Spark 3.5.1 (Hadoop 3.2) and the Celeborn `main` branch with JDK21 build changes in https://github.com/apache/incubator-celeborn/pull/2385.  Reproduced the runtime error above and confirmed the patch resolves it.

Closes #2387 from curtishoward/CELEBORN-1327.

Authored-by: Curtis Howard <curtis@curtiss-mbp.lan>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2024-03-13 23:08:09 +08:00
tiny-dust
6640e11557 [CELEBORN-1307] Introduce worker module for dashboard frontend
Introduce worker module for dashboard frontend:
![image](https://github.com/apache/incubator-celeborn/assets/49502875/92c76f50-b0a2-4050-af50-2a30c7eea94a)

The worker page provides Celeborn with worker lists and overview information.

No.

Local test.

Closes #2382 from tiny-dust/CELEBORN-1307.

Authored-by: tiny-dust <idioticzhou@foxmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-13 20:32:41 +08:00
CodingCat
2e251457f3 [CELEBORN-1309] Support adaptive management of memory threshold for SortBasedWriter
### What changes were proposed in this pull request?

while SortBasedWriter has less memory footprint than HashBasedWriter, it suffers from performance issue when we have  many partitions and the write buffer is filled with small chunks of data quickly

 for example, if sort buffer size is 32K, you have 4 partitions and 128K data in total, the data distribution is like partition A, B, C, D, each time it comes with 8K per partition.... in this case, you need to compress and send small 8K chunk 4 times per partition , the cost would become very high. If you use hashbasedwriter, it doesn't have this problem since the push only happens when the per-partition buffer is full. Of course , larger sort buffer size can mitigate the issue, but tuning sort buffer size per job is a tedious work

this PR introduces a new feature that we measure total size of pushed bytes and pushed count as well as the "should-pushed" bytes and counts (should-push means that , the data we pushed is larger than CLIENT_PUSH_BUFFER_MAX_SIZE (in another word, we will trigger a push even with hashbasedwriter in this case))

when actualPushedBytes/actualPushedCounts > (1 + Threshold) * (ShouldPushBytes/ShouldPushCounts), we will enlarge the sort buffer size by 1X to try to buffer more data before pushing  (the max size of sortBuffer would be capped at # of partitions * CLIENT_PUSH_BUFFER_MAX_SIZE)

### Why are the changes needed?

to reduce perf cost in sortbased writer

### Does this PR introduce _any_ user-facing change?

no, but have 2 extra configurations

### How was this patch tested?

in prod of our company and also unit test

Closes #2358 from CodingCat/adaptive_memory_threshold.

Authored-by: CodingCat <zhunansjtu@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-13 13:54:12 +08:00
Aravind Patnam
ca64bb5143 [CELEBORN-1313] Custom Network Location Aware Replication
### What changes were proposed in this pull request?

Enable custom network location aware replication, based on a custom impl of `DNSToSwitchMapping`.

### Why are the changes needed?

Resolution of network location of multiple workers at master can be expensive at times. This way, each worker resolves its own network location and sends to master via the RegisterWorker transport message. If worker cannot resolve, fallback to attempting to resolve at master (during update meta or reload of snapshot). Proposal: [Celeborn Custom Network Location Aware Replication](https://docs.google.com/document/d/11M_MKKnIXCTExJHMX-OMTq7SBpkl8fJMlpy8hLgmev0/edit#heading=h.s3vnydz589z5)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Updated the unit tests.

Closes #2367 from akpatnam25/CELEBORN-1313.

Authored-by: Aravind Patnam <apatnam@linkedin.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-13 11:10:30 +08:00
SteNicholas
a03b7b5a66
[CELEBORN-1326] FakedRemoteInputChannel use task name of RemoteShuffleInputGateDelegation as owningTaskName
### What changes were proposed in this pull request?

`FakedRemoteInputChannel` use task name of `RemoteShuffleInputGateDelegation` as `owningTaskName` to get the task name from log of `SingleInputGate`.

### Why are the changes needed?

`FakedRemoteInputChannel` use empty string as `owningTaskName`, which could not get the task name from the log of `SingleInputGate`, which could support using task name of `RemoteShuffleInputGateDelegation` as `owningTaskName`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2384 from SteNicholas/CELEBORN-1326.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 20:13:28 +08:00
SteNicholas
9612c20bb1
[MINOR] Improve SuiteJ of client-flink module
### What changes were proposed in this pull request?

Improve `SuiteJ` of `client-flink` module including:

- Align the name of test class with `SuiteJ`.
- Improve the test case of `SuiteJ`.

### Why are the changes needed?

`SuiteJ` of `client-flink` module could be improved.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2383 from SteNicholas/client-flink-suitej.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 17:51:57 +08:00
SteNicholas
0054930ce7
[CELEBORN-1323] Introduce ShutdownWorkerCount metric to record the count of workers in shutdown list
### What changes were proposed in this pull request?

Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list.

<img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb">

### Why are the changes needed?

`/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08)

Closes #2379 from SteNicholas/CELEBORN-1323.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-12 16:01:22 +08:00
MiyueSC
d52f934be0 [CELEBORN-1305] Introduce application module for dashboard frontend
### What changes were proposed in this pull request?

Introduce application module for dashboard frontend:

![image](https://github.com/apache/incubator-celeborn/assets/50617660/fbc2a0cb-4391-4068-bc37-af6c01e9ef75)

### Why are the changes needed?

Dashboard frontend should support application page to display the application list and overview details of Celeborn.

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

Local test.

Closes #2368 from miyuesc/CELEBORN-1305.

Lead-authored-by: MiyueSC <913784771@qq.com>
Co-authored-by: MiyueFE <913784771@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-12 14:24:01 +08:00
Fei Wang
597f594f93 [CELEBORN-1324][MINOR] Remove unused PrometheusSink class
### What changes were proposed in this pull request?

PrometheusSink is not used.

### Why are the changes needed?

Close CELEBORN-1324

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Not needed.

Closes #2381 from turboFei/remove_unused.

Authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-12 14:07:12 +08:00
zky.zhoukeyong
2bfbab7a47 [CELEBORN-1320] Use ReviveManager for soft splits
### What changes were proposed in this pull request?
Currently SOFT_SPLIT bypasses `ReviveManager` and sends `PartitionSplit` requests to
LifecycleManager individually, which can cause too many messages in `inbox`, see the issued
described in https://github.com/apache/incubator-celeborn/pull/2366

This PR uses `ReviveManager`, i.e. batch RPCs for `SOFT_SPLIT` events. Before this PR,
the max size of `Inbox#messages` is several hundreds in my experiment where frequent soft splits happen:
```
24/03/11 18:33:05 WARN [rpc-server-4-7] Inbox: last max msg cnt in 1 second: 620
24/03/11 18:33:06 WARN [rpc-server-4-5] Inbox: last max msg cnt in 1 second: 105
24/03/11 18:33:07 WARN [rpc-server-4-14] Inbox: last max msg cnt in 1 second: 94
24/03/11 18:33:08 WARN [rpc-server-4-13] Inbox: last max msg cnt in 1 second: 726
24/03/11 18:33:09 WARN [rpc-server-4-3] Inbox: last max msg cnt in 1 second: 50]
24/03/11 18:33:10 WARN [rpc-server-4-16] Inbox: last max msg cnt in 1 second: 98
24/03/11 18:33:11 WARN [rpc-server-4-12] Inbox: last max msg cnt in 1 second: 83
24/03/11 18:33:12 WARN [rpc-server-4-11] Inbox: last max msg cnt in 1 second: 138
24/03/11 18:33:13 WARN [rpc-server-4-9] Inbox: last max msg cnt in 1 second: 315
24/03/11 18:33:14 WARN [rpc-server-4-4] Inbox: last max msg cnt in 1 second: 787
```

After this PR, the size is reduced by one magnitude:
```
24/03/11 18:39:17 WARN [rpc-server-4-5] Inbox: last max msg cnt in 1 second: 30]
24/03/11 18:39:18 WARN [rpc-server-4-12] Inbox: last max msg cnt in 1 second: 1]
24/03/11 18:39:19 WARN [rpc-server-4-19] Inbox: last max msg cnt in 1 second: 1]
24/03/11 18:39:20 WARN [rpc-server-4-15] Inbox: last max msg cnt in 1 second: 1]
24/03/11 18:39:21 WARN [rpc-server-4-3] Inbox: last max msg cnt in 1 second: 10]
24/03/11 18:39:22 WARN [rpc-server-4-20] Inbox: last max msg cnt in 1 second: 1]
24/03/11 18:39:23 WARN [rpc-server-4-12] Inbox: last max msg cnt in 1 second: 1]
24/03/11 18:39:24 WARN [rpc-server-4-24] Inbox: last max msg cnt in 1 second: 1]
24/03/11 18:39:25 WARN [rpc-server-4-9] Inbox: last max msg cnt in 1 second: 10]
24/03/11 18:39:26 WARN [rpc-server-4-13] Inbox: last max msg cnt in 1 second: 1]
24/03/11 18:39:27 WARN [rpc-server-4-2] Inbox: last max msg cnt in 1 second: 10]
24/03/11 18:39:28 WARN [rpc-server-4-2] Inbox: last max msg cnt in 1 second: 80]
```

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA and manual test.

Closes #2377 from waitinfuture/1320.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-12 11:50:38 +08:00
SteNicholas
dee4afc580
[CELEBORN-1322] Rename LostWorkers metric to LostWorkerCount to align the naming style
### What changes were proposed in this pull request?

Rename `LostWorkers` metric to `LostWorkerCount` to align the naming style of other worker count metrics.

### Why are the changes needed?

The naming of `LostWorkers` metric is different from other metric of `MasterSource` like `WorkerCount`, `ExcludedWorkerCount` etc, which could be renamed to `LostWorkerCount` to align the naming style.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2378 from SteNicholas/CELEBORN-1322.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 20:41:22 +08:00
Angerszhuuuu
4cd4b6d5ce [CELEBORN-1321] Change noisy expire shuffle log to debug level and aggregate log
### What changes were proposed in this pull request?
Change noisy expire shuffle log to debug level and aggregate log

### Why are the changes needed?
Remove noisy expire log

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #2376 from AngersZhuuuu/CELEBORN-1321.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-03-11 18:54:34 +08:00
Chandni Singh
96df0d6e3c [CELEBORN-1179] Add support in Celeborn Workers to fetch application meta from the Master
### What changes were proposed in this pull request?
This enables a Celeborn Worker to retrieve the application meta from the Master if it hasn't received the secret from the Master before the application attempts to connect to it. Additionally, the Celeborn Worker's SecretRegistry has been converted into an LRU cache to prevent unbounded growth of the registry.

### Why are the changes needed?
This is last change needed for Auth support in Celeborn (https://issues.apache.org/jira/browse/CELEBORN-1011)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UTs and part of a bigger change which will be tested end-to-end.

Closes #2363 from otterc/CELEBORN-1179.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-11 14:15:08 +08:00
SteNicholas
36f2e03138
[MINOR] Fix style and Gluten link in Developers Doc
### What changes were proposed in this pull request?

Fix style and Gluten link in Developers Doc.

### Why are the changes needed?

- `slotsallocation.md` has the following wrong style:

<img width="1434" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/97fb53ed-473d-4f3d-8231-1fb613df9132">

- Gluten is apache incubating projetc, of which the link of Gluten project should be [Gluten](https://github.com/apache/incubator-gluten).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2375 from SteNicholas/developers-doc.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 12:07:01 +08:00
SteNicholas
07d342151c [CELEBORN-1284][FOLLOWUP] Fix license style of quota_management.md
### What changes were proposed in this pull request?

Fix license style of `quota_management.md`.

### Why are the changes needed?

The license style of `quota_management.md` is wrong.

<img width="1438" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/4a00724d-5fec-4b25-b134-d814c3152efd">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2374 from SteNicholas/CELEBORN-1284.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-11 11:37:30 +08:00
SteNicholas
7f6cb6735f
[CELEBORN-1316] Override toString method for StoreVersion
### What changes were proposed in this pull request?

Override `toString` method for `StoreVersion`.

### Why are the changes needed?

Avoid displaying `StoreVersionhashCode` in the `IOException` thrown after the checkVersion check fails in `RocksDBProvider`/`LevelDBProvider`, show something like:
```
cannot read state DB with version org.apache.celeborn.service.deploy.worker.shuffledb.StoreVersion1f, incompatible with current version org.apache.celeborn.service.deploy.worker.shuffledb.StoreVersion3e
```
Backport https://github.com/apache/spark/pull/44624.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `DBProviderSuiteJ`

Closes #2372 from SteNicholas/CELEBORN-1316.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-11 09:30:59 +08:00
sychen
b3eed34b57
[CELEBORN-1293] Output received signals at master and worker
### What changes were proposed in this pull request?
When we shut down the master or worker, we can output the signal as a record.

### Why are the changes needed?
Conveniently track the status of master and workers.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
local test

```bash
./sbin/stop-all.sh
```

```
12:20:59.932 [SIGTERM handler] ERROR org.apache.celeborn.service.deploy.master.Master - RECEIVED SIGNAL TERM
```

```
12:20:59.563 [SIGTERM handler] ERROR org.apache.celeborn.service.deploy.worker.Worker - RECEIVED SIGNAL TERM
```

Closes #2334 from cxzl25/CELEBORN-1293.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-08 15:48:57 +08:00
zky.zhoukeyong
2456325ff5 [CELEBORN-1312] Move handleRequestPartitions out of sync block
### What changes were proposed in this pull request?
As title, `handleRequestPartitions` is quite heavy since it calls sync RPC.
It's unnecessary to put it in the sync block.
This fixes the same issue as https://github.com/apache/incubator-celeborn/pull/2207

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA and manual test.

Closes #2364 from waitinfuture/1312.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-08 15:33:31 +08:00
SteNicholas
19b78dedd9
[CELEBORN-1315] Manually close the RocksDB/LevelDB instance when checkVersion throw Exception
### What changes were proposed in this pull request?

Should close the `RocksDB`/`LevelDB` instance when `checkVersion` throw Exception.

Backport [[SPARK-46389][CORE] Manually close the RocksDB/LevelDB instance when checkVersion throw Exception](https://github.com/apache/spark/pull/44327).

### Why are the changes needed?

In the process of initializing the DB in `RocksDBProvider`/`LevelDBProvider`, there is a `checkVersion` step that may throw an exception. After the exception is thrown, the upper-level caller cannot hold the already opened RockDB/LevelDB instance, so it cannot perform resource cleanup, which poses a potential risk of handle leakage. So this PR manually closes the `RocksDB`/`LevelDB` instance when `checkVersion` throws an exception.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2369 from SteNicholas/CELEBORN-1315.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-08 15:03:57 +08:00
tiny-dust
e78f63217a [CELEBORN-1302] Introduce overview module for dashboard frontend
### What changes were proposed in this pull request?

Introduce overview module for dashboard frontend:

![image](https://github.com/apache/incubator-celeborn/assets/49502875/61e6f75b-d1a9-4614-a06c-3c701a0f5f8d)

### Why are the changes needed?

Dashboard frontend should support overview page to display the overview details of Celeborn.

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

Local test.

Closes #2361 from tiny-dust/CELEBORN-1302.

Authored-by: tiny-dust <idioticzhou@foxmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-08 10:02:47 +08:00
SteNicholas
0f60dce791 [CELEBORN-1282][FOLLOWUP] Fix FetchHandler#handleEndStreamFromClient NullPointerException after recycling stream of CreditStreamManager
### What changes were proposed in this pull request?

Fix `FetchHandler#handleEndStreamFromClient` `NullPointerException` after recycling stream for `streams` which removes the corresponding `streamId`.

### Why are the changes needed?

`FetchHandler#handleEndStreamFromClient` needs to get the shuffle key to record application active connection. But after recycling stream, `FetchHandler#handleEndStreamFromClient` may cause `NullPointerException`. Because `recycleStream` may be invoked in `MapPartitionDataReader`, which causes that the corresponding `streamId` is removed before `FetchHandler#handleEndStreamFromClient`.
```
24/03/05 13:27:14,522 DEBUG [worker-credit-stream-manager-recycler] MapDataPartition: release all for stream: 990159671000
24/03/05 13:27:14,524 DEBUG [worker-credit-stream-manager-recycler] MapDataPartition: release map data partition FileInfo{file=/data00/home/guoyangze/data/celeborn-worker/shuffle_data/1709616425086-343fe33c97559405b474412efc0d9ce5/0/0-0-0, chunkOffsets=0, userIdentifier=`default`.`default`, partitionType=MAP}
24/03/05 13:27:14,531 ERROR [fetch-server-11-1] TransportRequestHandler: Error while invoking handler#receive() on RPC id 18
java.lang.NullPointerException
        at org.apache.celeborn.service.deploy.worker.storage.CreditStreamManager.getStreamShuffleKey(CreditStreamManager.java:189)
        at org.apache.celeborn.service.deploy.worker.FetchHandler.handleEndStreamFromClient(FetchHandler.scala:369)
        at org.apache.celeborn.service.deploy.worker.FetchHandler.handleRpcRequest(FetchHandler.scala:143)
        at org.apache.celeborn.service.deploy.worker.FetchHandler.receive(FetchHandler.scala:97)
        at org.apache.celeborn.common.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:96)
        at org.apache.celeborn.common.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:84)
        at org.apache.celeborn.common.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:156)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at org.apache.celeborn.common.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:74)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2356 from SteNicholas/CELEBORN-1282.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-07 17:59:15 +08:00
Chandni Singh
835437f0b9 [CELEBORN-1261] Add auth support to client
### What changes were proposed in this pull request?
This enables client to push and fetch shuffle data securely to Celeborn Workers.

### Why are the changes needed?
This change is required for adding authentication. ([CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011)).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
It is part of bigger change which will be tested end to end.

Closes #2360 from otterc/CELEBORN-1261.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-07 09:54:28 +08:00
Chandni Singh
6897b8be99 [CELEBORN-1234] Master should persist the application meta in Ratis and push it to the Workers
### What changes were proposed in this pull request?
This enables Celeborn Master to persist application meta in Ratis and also push it to Celeborn Workers when it receives the requests for slots from the LifecycleManager.

### Why are the changes needed?
This change is required for adding authentication. ([CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011)).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added some UTs.

Closes #2346 from otterc/CELEBORN-1234.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-06 17:02:04 +08:00
SteNicholas
dbca7536b6
[CELEBORN-1311] Developers Doc introduce Slots allocation
### What changes were proposed in this pull request?

`Developers Doc` introduce `Slots allocation`.

### Why are the changes needed?

Navigation of `mkdocs.yml` does not include the link of `slotsallocation.md`, which should be introduced.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2359 from SteNicholas/CELEBORN-1311.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-06 16:44:25 +08:00
sychen
2d59c46d8f
[CELEBORN-1298][FOLLOWUP] Support Spark2.4 with Scala2.12
### What changes were proposed in this pull request?

### Why are the changes needed?
sbt is not tested in spark2.4 scala2.12 environment.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2357 from cxzl25/CELEBORN-1298-F.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shaoyun Chen <csy@apache.org>
2024-03-06 11:13:08 +08:00
zky.zhoukeyong
8b6bc35997 [CELEBORN-1300] Optimize CelebornInputStreamImpl's memory usage
### What changes were proposed in this pull request?
To  avoid too much memory usage when CelebornShuffleReader creates input streams.
This PR does the following:

1. Constructor of `CelebornInputStream` does not fetch chunk
2. `compressedBuf` and `rawDataBuf` are created first time `fillBuffer` is called
3. When `fillBuffer` returns false, which means the inputstream is exhausted, `close` is called and resource released
4. `CelebornFetchFailureSuite` is only run for Spark 3.0 and newer

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA and e2e test.

Closes #2348 from waitinfuture/1300.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-05 14:03:11 +08:00
labbomb
0285021392 [CELEBORN-1303] Introduce API request module for dashboard frontend
### What changes were proposed in this pull request?

- Adds api request module.
- Adds license about `varlet/axle`.
- Adss pagination function.

### Why are the changes needed?

API request module is required to request the backend interface for dashboard frontend.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2353 from labbomb/CELEBORN-1303.

Authored-by: labbomb <labbomb@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-03-05 12:39:21 +08:00
SteNicholas
ce5386397d
[CELEBORN-1134][FOLLOWUP] Add execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING to Flink Configuration of Deploy Flink client
### What changes were proposed in this pull request?

Add `execution.batch-shuffle-mode: ALL_EXCHANGES_BLOCKING` to `Flink Configuration` of `Deploy Flink client` in `deploy.md`

### Why are the changes needed?

Validation whether `execution.batch-shuffle-mode` is `ALL_EXCHANGES_BLOCKING` is supported in #2106. `Flink Configuration` of `Deploy Flink client` should also add this configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No.

Closes #2355 from SteNicholas/CELEBORN-1134.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-04 15:57:40 +08:00
zky.zhoukeyong
cae4de1cc1 [CELEBORN-1301] Catch and throw FetchFailedException in CelebornInputStream#fillBuffer
### What changes were proposed in this pull request?
Catch and throw FetchFailedException in CelebornInputStream#fillBuffer to enable spark's stage rerun
when fillBuffer encounters fetch chunk exception

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #2349 from waitinfuture/1301.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2024-03-03 13:32:40 +08:00
SteNicholas
93d2b9f47f [CELEBORN-1298][FOLLOWUP] Support Spark2.4 with Scala2.12
### What changes were proposed in this pull request?

Support Spark2.4 with Scala2.12 in `sbt.md`. Meanwhile, the CI workflow adds the test for Spark2.4 and Scala2.12.

Follow up #2344.

### Why are the changes needed?

Spark2.4 with Scala2.12 is supported.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2345 from SteNicholas/CELEBORN-1298.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-29 21:48:51 +08:00
Keyong Zhou
af8c159c9f [CELEBORN-1298] Support Spark2.4 with Scala2.12
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
As title

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
GA

Closes #2344 from waitinfuture/1298-1.

Lead-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-29 17:58:54 +08:00
Erik.fang
6d9fbf5e49 [CELEBORN-1271] Fix unregisterShuffle with celeborn.client.spark.fetch.throwsFetchFailure disabled
### What changes were proposed in this pull request?
per https://issues.apache.org/jira/browse/CELEBORN-1271
fix the bug with SparkShuffleManager.unregisterShuffle when celeborn.client.spark.fetch.throwsFetchFailure=false

### Why are the changes needed?
the bug causes shuffle data can't be cleaned with unregisterShuffle

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
manual tested

Closes #2305 from ErikFang/CELEBORN-1271-fix-unregisterShuffle.

Authored-by: Erik.fang <fmerik@gmail.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-29 16:17:54 +08:00
Chandni Singh
d5a1bcdb6d [CELEBORN-1256] Added internal port and auth support to Celeborn worker
### What changes were proposed in this pull request?

This adds an internal port and auth support to Celeborn Wokers.
1. Internal port is used by a worker to receive messages from Celeborn Master.
2. Authentication support for secure communication with clients. This change doesn't add the support in clients to communicate to the Workers securely. That will be in a future change.

This change targets just adding the port and auth support to Worker. The following items from the proposal are still pending:

- Persisting the app secrets in Ratis.
- Forwarding secrets to Workers and having ability for the workers to pull registration info from the Master.
- Secured communication between workers and clients.

### Why are the changes needed?
It is needed for adding authentication support to Celeborn ([CELEBORN-1011](https://issues.apache.org/jira/browse/CELEBORN-1011))

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Part of a bigger change. For this change, only modified existing UTs.

Closes #2292 from otterc/CELEBORN-1256.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>
2024-02-29 10:09:22 +08:00
Angerszhuuuu
3f884660b4 [CELEBORN-1292] Remove app level metrics from worker and master
### What changes were proposed in this pull request?
Master side calculate sub resource consumption occupy cpu cause rpc time out and miss prometheus metrics
<img width="1781" alt="截屏2024-02-28 12 04 19" src="https://github.com/apache/incubator-celeborn/assets/46485123/ba49a4ac-ec49-4234-8758-c0db9242abf6">

Worker side generate too much metrics data

### Why are the changes needed?
Fix performance issue

### Does this PR introduce _any_ user-facing change?
Remove app level metrics

### How was this patch tested?

Closes #2342 from AngersZhuuuu/CELEBORN-1292.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>
2024-02-28 17:07:51 +08:00
Angerszhuuuu
8391db125b [CELEBORN-1284][DOC] Add document about QuotaManager based on ConfigService
### What changes were proposed in this pull request?
Add document about QuotaManager based on ConfigService

### Why are the changes needed?
Add document about QuotaManager based on ConfigService

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #2325 from AngersZhuuuu/CELEBORN-1284.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-28 16:43:29 +08:00
mingji
cd78ebbf20
[CELEBORN-1295][FOLLOWUP] Change repo_url to apache repo
### What changes were proposed in this pull request?
Remove GitHub repo reference link.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2343 from FMX/B1295-1.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-02-28 15:31:18 +08:00
mingji
eed4f924b2 [CELEBORN-1295] Add tm to Celeborn's website
### What changes were proposed in this pull request?
Add trace mark symbol.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2338 from FMX/B1295.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-02-28 14:00:15 +08:00
Angerszhuuuu
fe5d968f43 [CELEBORN-1174][FOLLOWUP] Master computeResourceConsumption miss applicationId
### What changes were proposed in this pull request?
 Master computeResourceConsumption miss applicationId

### Why are the changes needed?
 Master computeResourceConsumption miss applicationId

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #2341 from AngersZhuuuu/CELEBORN-1174-FOLLOW.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-02-28 11:57:41 +08:00