Commit Graph

318 Commits

Author SHA1 Message Date
Wang, Fei
d038dd2b32 [CELEBORN-1258] Support to register application info with user identifier and extra info
### What changes were proposed in this pull request?
Support to register application info with user identifier and extra info.

### Why are the changes needed?
To provide more insight for the application information.

### Does this PR introduce _any_ user-facing change?
A new RESTful API.

### How was this patch tested?
UT.

Closes #3428 from turboFei/app_info_uid.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-09-01 11:15:40 +08:00
dz
2817f7fb9e [CELEBORN-2104] Clean up sources of NettyRpcEnv, Master and Worker to avoid thread leaks
### What changes were proposed in this pull request?

Fix NettyRpcEnv,Master,Worker `Source` to avoid thread leak

### Why are the changes needed?

1. NettyRpcEnv should clean rpcSource to prevent resource leak.
2. Master clean resourceConsumptionSource, masterSource, threadPoolSource, jVMSource, jVMCPUSource, systemMiscSource
3. Worker clean clean workerSource.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

local

Closes #3418 from xy2953396112/CELEBORN-2104.

Authored-by: dz <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-29 19:04:19 +08:00
gaoyajun02
1be3094fb2 [CELEBORN-2132] Enhance ratis peer add operation to support clientAddress & adminAddress
### What changes were proposed in this pull request?

This PR enhances the Ratis peer add operation to support clientAddress and adminAddress parameters with RESTful api, allowing these critical RPC endpoints to be properly configured when adding new peers to the Celeborn master cluster.

### Why are the changes needed?

Currently, when expanding the Celeborn master cluster using the ratis peer add operation, newly added peers lack clientAddress and adminAddress settings. If a newly added peer becomes the Leader, all Followers will return empty addresses to clients, causing them to attempt connections to an incorrect Leader address (127.0.0.1:0). This change ensures proper client request routing in expanded clusters.

### Does this PR introduce _any_ user-facing change?

Yes, this PR extends the API for adding Ratis peers by adding support for clientAddress and adminAddress parameters. Users will now be able to specify these addresses when adding new peers to the cluster.

### How was this patch tested?

Manual testing of cluster expansion scenarios to ensure clients can correctly connect to the Leader regardless of which peer holds leadership

```
➜ curl -sX  POST zw06-data-k8s-sparktest-node007.mt:9098/api/v1/ratis/peer/add \
  -H "Content-Type: application/json" \
  -d '{ "peers": [{"id": "2", "address": "zw06-data-k8s-sparktest-node009.mt:9872", "clientAddress": "zw06-data-k8s-sparktest-node009.mt:9097", "adminAddress":  "zw06-data-k8s-sparktest-node009.mt:9097" }] }' | jq

{
  "success": true,
  "message": "Successfully added peers ArrayBuffer(2|zw06-data-k8s-sparktest-node009.mt:9872) to group GroupInfoReply:client-3E7C9CE679B2->0group-47BEDE733167, cid=1031, SUCCESS, logIndex=0, commits[0:c224, 1:c224]."
}

➜ curl -s zw06-data-k8s-sparktest-node009.mt:9098/masterGroupInfo
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 0(zw06-data-k8s-sparktest-node007.mt:9872)

[server {
  id: "2"
  address: "zw06-data-k8s-sparktest-node009.mt:9872"
  clientAddress: "zw06-data-k8s-sparktest-node009.mt:9097"
  adminAddress: "zw06-data-k8s-sparktest-node009.mt:9097"
  startupRole: FOLLOWER
}
commitIndex: 228
, server {
  id: "0"
  address: "zw06-data-k8s-sparktest-node007.mt:9872"
  clientAddress: "zw06-data-k8s-sparktest-node007.mt:9097"
  adminAddress: "zw06-data-k8s-sparktest-node007.mt:9097"
  startupRole: FOLLOWER
}
commitIndex: 228
, server {
  id: "1"
  address: "zw06-data-k8s-sparktest-node008.mt:9872"
  clientAddress: "zw06-data-k8s-sparktest-node008.mt:9097"
  adminAddress: "zw06-data-k8s-sparktest-node008.mt:9097"
  startupRole: FOLLOWER
}
commitIndex: 228
]

```

Closes #3452 from gaoyajun02/ratis.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-08-28 11:29:51 +08:00
yuanzhen
8effb735f7 [CELEBORN-2066] Release workers only with high workload when the number of excluded worker set is too large
### What changes were proposed in this pull request?

Provide user options to enable release workers only with high workload when the number of excluded worker set is too large.

### Why are the changes needed?

In some cases, a large percentage of workers were excluded, but most of them were due to high workload. It's better to release such workers from excluded set to ensure the system availability is a priority.

### Does this PR introduce _any_ user-facing change?

New Configuration Option.

### How was this patch tested?
Unit tests.

Closes #3365 from Kalvin2077/exclude-high-stress-workers.

Lead-authored-by: yuanzhen <yuanzhen.hwk@alibaba-inc.com>
Co-authored-by: Kalvin2077 <wk.huang2077@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-08-22 10:14:38 +08:00
zhaohehuhu
a498b1137f [CELEBORN-1984] Merge ResourceRequest to transportMessageProtobuf
### What changes were proposed in this pull request?
as title

### Why are the changes needed?

Merge Resource.proto into TransportMessages.proto as per the below design

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-16+Merge+transport+proto+and+resource+proto+files

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #3231 from zhaohehuhu/dev-0425.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-08-01 23:28:32 +08:00
Kalvin2077
c6e68fddfa [CELEBORN-2053] Refactor remote storage configration usage
### What changes were proposed in this pull request?

Refactoring similar code about configuration usage.

### Why are the changes needed?

Improve scalability for possible new remote storage in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #3353 from Kalvin2077/draft.

Authored-by: Kalvin2077 <wk.huang2077@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-07-28 16:56:32 +08:00
sychen
df0def6701 [CELEBORN-2082] Add the log of excluded workers with high workloads
### What changes were proposed in this pull request?

### Why are the changes needed?
When workers with higher workloads are excluded, the master does not have a clear log.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #3391 from cxzl25/CELEBORN-2082.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-25 20:56:18 +08:00
Aravind Patnam
765265a87d [CELEBORN-2031] Interruption Aware Slot Selection
### What changes were proposed in this pull request?
This PR is part of [CIP17: Interruption Aware Slot Selection](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-17%3A+Interruption+Aware+Slot+Selection).

It makes the changes in the slot selection logic to prioritize workers that do not have interruption "soon". See more context about the slot selection logic [here](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=362056201#CIP17:InterruptionAwareSlotSelection-SlotsAllocator).

### Why are the changes needed?
see [CIP17: Interruption Aware Slot Selection](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-17%3A+Interruption+Aware+Slot+Selection).

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
unit tests. This is also already in production in our cluster for last 4-5 months.

Closes #3347 from akpatnam25/CELEBORN-2031-impl.

Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-15 17:33:00 +08:00
codenohup
0fa600ade1 [CELEBORN-2055] Fix some typos
### What changes were proposed in this pull request?
Inspired by [FLINK-38038](https://issues.apache.org/jira/projects/FLINK/issues/FLINK-38038?filter=allissues]), I used [Tongyi Lingma](https://lingma.aliyun.com/) and qwen3-thinking LLM to identify and fix some typo issues in the Celeborn codebase. For example:
- backLog → backlog
- won`t → won't
- can to be read → can be read
- mapDataPartition → mapPartitionData
- UserDefinePasswordAuthenticationProviderImpl → UserDefinedPasswordAuthenticationProviderImpl

### Why are the changes needed?
Remove typos to improve source code readability for users and ease development for developers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Code and documentation cleanup does not require additional testing.

Closes #3356 from codenohup/fix-typo.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-10 12:01:02 +08:00
Xianming Lei
03f97e6166 [CELEBORN-1577][FOLLOWUP] Improve check quota message
### What changes were proposed in this pull request?
Improve check quota message.

### Why are the changes needed?
Make check quota message clearer.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #3328 from leixm/follow_CELEBORN-1577.

Authored-by: Xianming Lei <xianming.lei@shopee.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-12 11:01:18 -07:00
Aravind Patnam
ebfa1d8cf4 [CELEBORN-2014] updateInterruptionNotice REST API
### What changes were proposed in this pull request?
This PR is part of [CIP17: Interruption Aware Slot Selection](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-17%3A+Interruption+Aware+Slot+Selection).
It introduces a REST api for external services to notify master about interruptions/schedules.

### Why are the changes needed?
To nofify master of upcoming interruption notices in the worker fleet. Master can then use these to proactively deprioritize workers that might be in scope for interruption sooner.

### Does this PR introduce _any_ user-facing change?
new rest api

### How was this patch tested?
added unit tests.

Closes #3285 from akpatnam25/CELEBORN-2014.

Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-06-06 14:07:49 +08:00
Sanskar Modi
aceee64c73 [CELEBORN-2018] Support min number of workers selected for shuffle
### What changes were proposed in this pull request?
Support min number of workers to assign slots on for a shuffle.

### Why are the changes needed?

PR https://github.com/apache/celeborn/pull/3039 updated the default value of `celeborn.master.slot.assign.extraSlots` to avoid skew in shuffle stage with less number of reducers. However, it will also affect the stage with large number of reducers, thus not ideal.

We are introducing a new config `celeborn.master.slot.assign.minWorkers` which will ensure that shuffle stages with less number of reducers will not cause load imbalance on few nodes.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
NA

Closes #3297 from s0nskar/min_workers.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-01 08:23:53 -07:00
Shuang
0227a1ab29 [CELEBORN-1627][FOLLOWUP] Fix the issue where the case of name affects the metrics dashboard
### What changes were proposed in this pull request?
Revert role name change in [CELEBORN-1627](https://github.com/apache/celeborn/pull/2777)

### Why are the changes needed?
Fix the issue where the case of name affects the metrics dashboard

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual

Closes #3299 from RexXiong/CELEBORN-1627-FOLLOWUP.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-29 22:40:13 -07:00
Xianming Lei
d2befe0334 [CELEBORN-2008] SlotsAllocator should select disks randomly in RoundRobin mode
### What changes were proposed in this pull request?
SlotsAllocator should select disks randomly in RoundRobin mode

### Why are the changes needed?
The current round robin selection mechanism is to select the first disk of each worker first, then the second disk of each worker, and finally the third disk. This can easily cause disk storage space skew. We should select disks randomly instead of selecting the first disk first.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #3275 from leixm/CELEBORN-2008.

Authored-by: Xianming Lei <xianming.lei@shopee.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-22 17:18:42 -07:00
SteNicholas
46d9d63e1f [CELEBORN-1916][FOLLOWUP] Improve Aliyun OSS support
### What changes were proposed in this pull request?

Improve Aliyun OSS support including `SlotsAllocator#offerSlotsLoadAware`, `Worker#heartbeatToMaster` and `PartitionDataWriter#getStorageInfo`.

### Why are the changes needed?

There are many methods where OSS support is lacking in `SlotsAllocator#offerSlotsLoadAware`, `Worker#heartbeatToMaster` and `PartitionDataWriter#getStorageInfo`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3268 from SteNicholas/CELEBORN-1916.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-21 11:44:50 +08:00
Wang, Fei
90ece9665c [CELEBORN-2002][MASTER] Audit shuffle lifecycle in separate log file
### What changes were proposed in this pull request?
Audit shuffle lifecycle in separate log file
- OFFER_SLOTS
- EXPIRE
- REVIVE
- UNREGISTER

### Why are the changes needed?
 Remove redundant logs of expired shuffle in master-worker heartbeat, see https://github.com/apache/celeborn/pull/3244

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
```
(base) ➜  celeborn git:(shuffle_audit) grep ShuffleAuditLogger tests/spark-it/target/unit-tests.log
25/05/19 20:05:27,031 INFO [celeborn-dispatcher-41] ShuffleAuditLogger: shuffleKey=local-1747710326897-0        op=OFFER_SLOTS  numReducers=4   workerNum=5     extraSlots=1
25/05/19 20:05:27,719 INFO [celeborn-dispatcher-44] ShuffleAuditLogger: shuffleKey=local-1747710326897-1        op=OFFER_SLOTS  numReducers=2   workerNum=5     extraSlots=3
25/05/19 20:05:28,094 INFO [celeborn-dispatcher-47] ShuffleAuditLogger: shuffleKey=local-1747710326897-2        op=OFFER_SLOTS  numReducers=2   workerNum=5     extraSlots=3
25/05/19 20:05:28,467 INFO [celeborn-dispatcher-52] ShuffleAuditLogger: shuffleKey=local-1747710326897-3        op=OFFER_SLOTS  numReducers=8   workerNum=5     extraSlots=0
25/05/19 20:05:28,769 INFO [celeborn-dispatcher-53] ShuffleAuditLogger: shuffleKey=local-1747710326897-4        op=OFFER_SLOTS  numReducers=8   workerNum=5     extraSlots=0
25/05/19 20:05:29,720 INFO [celeborn-dispatcher-56] ShuffleAuditLogger: shuffleKey=local-1747710326897-5        op=OFFER_SLOTS  numReducers=200 workerNum=5     extraSlots=0
25/05/19 20:05:30,349 INFO [celeborn-dispatcher-59] ShuffleAuditLogger: shuffleKey=local-1747710326897-6        op=OFFER_SLOTS  numReducers=4   workerNum=5     extraSlots=1
25/05/19 20:05:40,534 INFO [celeborn-dispatcher-11] ShuffleAuditLogger: shuffleKey=local-1747710340484-0        op=OFFER_SLOTS  numReducers=4   workerNum=5     extraSlots=1
25/05/19 20:05:41,101 INFO [celeborn-dispatcher-14] ShuffleAuditLogger: shuffleKey=local-1747710340484-1        op=OFFER_SLOTS  numReducers=2   workerNum=5     extraSlots=3
25/05/19 20:05:41,480 INFO [celeborn-dispatcher-17] ShuffleAuditLogger: shuffleKey=local-1747710340484-2        op=OFFER_SLOTS  numReducers=2   workerNum=5     extraSlots=3
25/05/19 20:05:41,848 INFO [celeborn-dispatcher-26] ShuffleAuditLogger: shuffleKey=local-1747710340484-3        op=OFFER_SLOTS  numReducers=8   workerNum=5     extraSlots=0
25/05/19 20:05:42,136 INFO [celeborn-dispatcher-18] ShuffleAuditLogger: shuffleKey=local-1747710340484-4        op=OFFER_SLOTS  numReducers=8   workerNum=5     extraSlots=0
25/05/19 20:05:43,058 INFO [celeborn-dispatcher-21] ShuffleAuditLogger: shuffleKey=local-1747710340484-5        op=OFFER_SLOTS  numReducers=200 workerNum=5     extraSlots=0
25/05/19 20:05:43,542 INFO [celeborn-dispatcher-31] ShuffleAuditLogger: shuffleKey=local-1747710340484-6        op=OFFER_SLOTS  numReducers=4   workerNum=5     extraSlots=1
25/05/19 20:05:44,436 INFO [celeborn-dispatcher-29] ShuffleAuditLogger: shuffleKeys=local-1747710326897-0,local-1747710326897-1,local-1747710326897-2,local-1747710326897-3,local-1747710326897-4,local-1747710326897-5 op=EXPIRE       worker=127.0.0.1:59932:59934:59948:59941
25/05/19 20:05:44,436 INFO [celeborn-dispatcher-27] ShuffleAuditLogger: shuffleKeys=local-1747710326897-0,local-1747710326897-3,local-1747710326897-4,local-1747710326897-5,local-1747710326897-6       op=EXPIRE       worker=127.0.0.1:59930:59938:59944:59940
25/05/19 20:05:44,436 INFO [celeborn-dispatcher-32] ShuffleAuditLogger: shuffleKeys=local-1747710326897-1,local-1747710326897-2,local-1747710326897-3,local-1747710326897-4,local-1747710326897-5,local-1747710326897-6 op=EXPIRE       worker=127.0.0.1:59931:59936:59945:59939
25/05/19 20:05:44,436 INFO [celeborn-dispatcher-33] ShuffleAuditLogger: shuffleKeys=local-1747710326897-0,local-1747710326897-3,local-1747710326897-4,local-1747710326897-5,local-1747710326897-6       op=EXPIRE       worker=127.0.0.1:59933:59935:59946:59943
25/05/19 20:05:44,436 INFO [celeborn-dispatcher-28] ShuffleAuditLogger: shuffleKeys=local-1747710326897-0,local-1747710326897-3,local-1747710326897-4,local-1747710326897-5,local-1747710326897-6       op=EXPIRE       worker=127.0.0.1:59929:59937:59947:59942

```

Closes #3265 from turboFei/shuffle_audit.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-20 05:45:34 -07:00
SteNicholas
d9984c9e0e [CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application
### What changes were proposed in this pull request?

Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application.

### Why are the changes needed?

There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.

Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
- `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback`

Closes #3026 from SteNicholas/CELEBORN-1800.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-19 07:20:00 -07:00
SteNicholas
8e66ac833a [CELEBORN-1994] Introduce disruptor dependency to support asynchronous logging of log4j2
### What changes were proposed in this pull request?

Introduce disruptor dependency to support asynchronous logging of log4j2.

### Why are the changes needed?

We add `-Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector` in `CELEBORN_MASTER_JAVA_OPTS` and `CELEBORN_WOKRER_JAVA_OPTS` for production environment. `AsyncLoggerContextSelector` depends on disruptor dependency. Therefore, it's recommend to introduce disruptor dependency to support log4j2 asynchronous loggers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Cluster test.

Closes #3246 from SteNicholas/CELEBORN-1994.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-13 19:45:51 +08:00
Aravind Patnam
714722b5d3 [CELEBORN-1982] Slot Selection Perf Improvements
### What changes were proposed in this pull request?
After profiling to see where the hotspots are for slot selection, we identified 2 main areas:
- iter.remove ([link](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java#L447)) is a major hotspot, especially if partitionIdList is massive - since it is an ArrayList and we are removing from the begining - resulting in O(n) deletion costs.
- `haveDisk` is computed per partitionId, iterated across all workers.  We precompute this and store it as a field in `WorkerInfo`.

See the below flamegraph for the hotspot of `iter.remove` (`oop_disjoint_arraycopy`) after running a benchmark.

![Screenshot 2025-04-24 at 12 58 34 AM](https://github.com/user-attachments/assets/30bb38f7-9a92-4b52-8480-5e7f26b0d48b)

Below is what we actually observed in production which matches with the above observation from the benchmark:
![realprodflamegraph](https://github.com/user-attachments/assets/d06e095c-2d6d-4892-982a-1c2e828eb71e)

### Why are the changes needed?
speed up slot selection performance in the case of large partitionIds

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
After applying the above changes, we can see the hotspot is removed in the flamegraph:
![Screenshot 2025-04-24 at 12 53 24 AM](https://github.com/user-attachments/assets/99372140-5746-4a34-9918-642c81fb52e8)

Benchmarks:
Without changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration   2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration   3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration   4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration   5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration   1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration   2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration   3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration   4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration   5: 907655.363 ±(99.9%) 11893.789 us/op

Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  882048.066 ±(99.9%) 98360.936 us/op [Average]
  (min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
  CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)

# Run complete. Total time: 00:05:43

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  882048.066 ± 98360.936  us/op

Process finished with exit code 0
```
With changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection

# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration   1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration   2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration   3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration   4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration   5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration   1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration   2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration   3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration   4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration   5: 134179.546 ±(99.9%) 5137.425 us/op

Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
  135207.354 ±(99.9%) 20845.544 us/op [Average]
  (min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
  CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)

# Run complete. Total time: 00:05:29

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                       Mode  Cnt       Score       Error  Units
SlotsAllocatorBenchmark.benchmarkSlotSelection  avgt    5  135207.354 ± 20845.544  us/op

Process finished with exit code 0
```

882048.066 us/ops without changes vs 135207.354 us/op with changes. That is about 6.5x improvement.

Closes #3228 from akpatnam25/CELEBORN-1982.

Lead-authored-by: Aravind Patnam <akpatnam25@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-27 11:13:21 +08:00
zhaohehuhu
95f0acfbd1 [CELEBORN-1961] Convert Resource.proto from Protocol Buffers version 2 to version 3
### What changes were proposed in this pull request?

as title

### Why are the changes needed?
Upgrade PB version as fist step as per below design

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-16+Merge+transport+proto+and+resource+proto+files

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #3201 from zhaohehuhu/dev-0403.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-21 19:50:45 +08:00
Wang, Fei
f1b71e3eb7 [CELEBORN-1436][FOLLOWUP] Add swagger editor links for RESTful spec
### What changes were proposed in this pull request?

Add swagger editor links for RESTful spec.

Fix warn in the spec:
Master spec:
![image](https://github.com/user-attachments/assets/ff71aedf-c68d-472a-b0f8-e526d87d45ed)

Worker spec:
![image](https://github.com/user-attachments/assets/6820a25e-679f-4790-a3c2-d2757b34b0e4)

### Why are the changes needed?

To view the spec online.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
<img width="1103" alt="image" src="https://github.com/user-attachments/assets/0118e47d-da2d-43c8-a41d-085cde2ed06f" />

No warn now, see:
https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/turbofei/incubator-celeborn/openapi/openapi/openapi-client/src/main/openapi3/master_rest_v1.yaml

https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/turbofei/incubator-celeborn/openapi/openapi/openapi-client/src/main/openapi3/worker_rest_v1.yaml

Closes #3200 from turboFei/openapi.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-04-21 15:28:52 +08:00
veli.yang
7d0ba7f9b8 [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
### What changes were proposed in this pull request?

- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.

### Why are the changes needed?

- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Deployment integration testing has been completed in the local environment.

Closes #3157 from shouwangyw/optimize/mpu-oss.

Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-04-08 15:10:33 +08:00
Wang, Fei
1e30f159b9 [CELEBORN-1577][FOLLOWUP] Add UpdateResourceConsumptionTime timer and prevent NPE if metrics not found
### What changes were proposed in this pull request?
Follow up for https://github.com/apache/celeborn/pull/2819
1. add timer for UpdateResourceConsumptionTime
2. prevent NPE if metrics not found

### Why are the changes needed?

The timer not added and cause NPE.
```
25/03/31 13:18:48,219 WARN [master-quota-checker] MasterSource: Metric UpdateResourceConsumptionTime{instance="zeus-slc-cm-1.zeus-slc-cm-cm.hm-dev.svc.140.tess.io:8080",role="master"} not found!
25/03/31 13:18:48,220 WARN [master-quota-checker] MasterSource: Exception encountered during stop timer of metric UpdateResourceConsumptionTime{instance="zeus-slc-cm-1.zeus-slc-cm-cm.hm-dev.svc.140.tess.io:8080",role="master"}
scala.MatchError: null
	at org.apache.celeborn.common.metrics.source.AbstractSource.doStopTimer(AbstractSource.scala:316)
	at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:279)
	at org.apache.celeborn.service.deploy.master.quota.QuotaManager.updateResourceConsumption(QuotaManager.scala:201)
	at org.apache.celeborn.service.deploy.master.quota.QuotaManager$$anon$1.run(QuotaManager.scala:59)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Existing UT.

Closes #3190 from turboFei/fix_npe.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-04-01 19:24:23 +08:00
Xianming Lei
0a97ca0aa9 [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle
### What changes were proposed in this pull request?
1. Worker reports resourceConsumption to master
2. QuotaManager calculates the resourceConsumption of each app and marks the apps that exceed the quota.
    2.1 When the tenant's resourceConsumption exceeds the tenant's quota, select the app with a larger consumption to mark interrupted.
    2.2 When the resourceConsumption of the cluster exceeds the cluster quota, select the app with larger consumption to mark interrupted.
3. Master returns to Driver through heartbeat, whether app is marked interrupted

### Why are the changes needed?
The current storage quota logic can only limit new shuffles, and cannot limit the writing of existing shuffles. In our production environment, there is such an scenario: the cluster is small, but the user's app single shuffle is large which occupied disk resources, we want to interrupt those shuffle.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
UTs.

Closes #2819 from leixm/CELEBORN-1577-2.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-03-24 22:05:45 +08:00
Angerszhuuuu
151fd35676 [CELEBORN-1923] Correct Celeborn available slots calculation logic
### What changes were proposed in this pull request?
Fix incorrect logic when calculate disk available slots

### Why are the changes needed?
Now we use `usableSize / estimatedPartitionSize = maxSlots`
Then `availableSlots = maxSlots - allocatedSlots`
But `availableSlots` should be `usableSize / estimizatedPartitionSize`

### Does this PR introduce _any_ user-facing change?
Yea

### How was this patch tested?
MT

Closes #3162 from AngersZhuuuu/CELEBORN-1923.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-03-23 15:27:06 -07:00
SteNicholas
38f3bdd375 [CELEBORN-1909] Support pre-run static code blocks of TransportMessages to improve performance of protobuf serialization
### What changes were proposed in this pull request?

Support pre-run static code blocks of `TransportMessages` to improve performance of protobuf serialization.

### Why are the changes needed?

The protobuf message protocol defines many map type fields, which makes it time-consuming to build these message instances. This is because `TransportMessages` contains static code blocks to initialize a large number of `Descriptor`s and `FieldAccessorTable`s, where the instantiation of `FieldAccessorTable` includes reflection. The test result proves that the static code blocks execute in about 70 milliseconds.

Therefore, it's better to pre-run static code blocks of `TransportMessages` to improve performance of protobuf serialization. Meanwhile, it's recommended to use repeated instead of map type field for rpc messages.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3149 from SteNicholas/CELEBORN-1909.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-03-18 11:34:39 +08:00
Wang, Fei
fc056a3c3a [CELEBORN-1875] Support to get workers topology information with RESTful api
### What changes were proposed in this pull request?

Support to get the workers topology information with RESTful api.
1. return networkLocation in WorkerData
2. add new api `/api/v1/workers/topology` to return the grouped workers topology info.

### Why are the changes needed?

1. To get the workers topology information.
2. To know the rack awareness well.

### Does this PR introduce _any_ user-facing change?

No break change.
### How was this patch tested?
UT and IT.

<img width="1008" alt="image" src="https://github.com/user-attachments/assets/6cb1aa2a-1160-4570-acb1-7602e2ce0b09" />

<img width="719" alt="image" src="https://github.com/user-attachments/assets/d26c3326-4837-40ad-a344-3cb4204bf607" />

Closes #3112 from turboFei/topology.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-02-26 16:27:32 +08:00
gaoyajun02
c45197c0c1 [CELEBORN-1843] Optimize roundrobin for more balanced disk slot allocation
### What changes were proposed in this pull request?

This PR optimizes the RoundRobin algorithm to achieve a more balanced disk slot allocation across workers.
Previously, when allocating 3000 partitions using RoundRobin, the slot distribution across worker disks was [668, 666, 666], which resulted in one disk having 2 more slots than the others.
After the optimization, the slot distribution is now [667, 667, 666], ensuring a more balanced allocation.

### Why are the changes needed?

The changes are necessary to improve load balancing across worker disks, reducing the risk of overloading a single disk. This ensures a more predictable and fair distribution of slots, which can lead to better performance and resource utilization.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #3074 from gaoyajun02/1843.

Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-02-12 14:30:08 +08:00
Wang, Fei
f28ba6e728 [CELEBORN-1810] Using Operation description instead of ApiResponse description for RESTful APIs
### What changes were proposed in this pull request?
Using Operation description instead of ApiResponse description for RESTful APIs.

### Why are the changes needed?
Make the API description in correct place.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Before:
<img width="1434" alt="image" src="https://github.com/user-attachments/assets/7a92c4a7-d550-4221-bee2-d52918719521" />

After:
<img width="1433" alt="image" src="https://github.com/user-attachments/assets/0287b425-7b56-4ef7-ba5d-4b53b4208780" />

Closes #3038 from turboFei/api_desc.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-23 09:42:03 +08:00
zhengtao
ac0d335f40 [CELEBORN-1831] Add ratis commitIndex metrics
### What changes were proposed in this pull request?
Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).

### Why are the changes needed?
To observe the metadata synchronization of the raft cluster.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.
![image](https://github.com/user-attachments/assets/f354a3cd-e3b3-4af0-98c2-fc13330b2d81)

Closes #3063 from zaynt4606/clb1831.

Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-17 10:58:06 +08:00
Wang, Fei
b7e3eaa46d [CELEBORN-1477][FOLLOWUP] Minor fix for v1 RESTful apis before release
### What changes were proposed in this pull request?

Minor fix the v1 RESTful apis before 0.6.0 release.

1. update  the API description to use UPPER case worker EventType
2. `subResourceConsumption`  => `subResourceConsumptions`.

### Why are the changes needed?
1. With https://github.com/apache/celeborn/pull/2754, the openapi-sdk works well. but for the RESTful call without SDK, the worker eventType is still case sensitive, might be caused by the jersey issue mentioned in https://github.com/eclipse-ee4j/jersey/issues/5288. So, In this PR, I change the description in the swagger for user guidance.
<img width="1524" alt="image" src="https://github.com/user-attachments/assets/70e4f239-dc36-47bc-902e-5340986f014a" />

2. rename `subResourceConsumption`  to `subResourceConsumptions`.

### Does this PR introduce _any_ user-facing change?
No, the api has not been released.

### How was this patch tested?
GA.

Closes #3023 from turboFei/restful_minor_fix.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-02 23:00:15 +08:00
mingji
fde6365f68 [CELEBORN-1413] Support Spark 4.0
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.

### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Cluster test.

Closes #2813 from FMX/b1413.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 18:12:27 +08:00
Wang, Fei
03656b5b1c [CELEBORN-1634][FOLLOWUP] Add rpc metrics into grafana dashboard
### What changes were proposed in this pull request?

1. rename the RPC metrics name from `${name}_${metric}` to `Rpc${metric}{name=$name}` so that it is easy to add into grafana dashboard
2. Use MASTER/WORKER/CLIENT Role for rpc env.
3. add the rpc metrics into grafana dashboard.

### Why are the changes needed?

For monitoring

### Does this PR introduce _any_ user-facing change?
No, it has not been released

### How was this patch tested?
UT for  metrics source `instance`.

<img width="1456" alt="image" src="https://github.com/user-attachments/assets/90284390-54ad-49ef-a868-fa537d2301b8">

<img width="1880" alt="image" src="https://github.com/user-attachments/assets/e8101e47-d649-4c66-9978-1efb4faa047f">

Closes #2990 from turboFei/rpc_metrics.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-24 11:13:49 +08:00
Wang, Fei
680b072b5b [CELEBORN-1753] Optimize the code for exists and find method
### What changes were proposed in this pull request?

Optimize the code for `exists` and `find`.

1.  Enhance the performance to lookup workerInfo by workerUniqueId instead of looping the collection:
 74c1ec0a7f/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala (L65-L66)

Change the type to:
```
 type ShuffleAllocatedWorkers =
    ConcurrentHashMap[Int, ConcurrentHashMap[String, ShufflePartitionLocationInfo]]
```
And save the `WorkerInfo` into `ShufflePartitionLocationInfo`.
```
class ShufflePartitionLocationInfo(val workerInfo: WorkerInfo) {
...
}
```

So that, we can get the `WorkerInfo` by worker uniqueId fast.

2. Reduce the loop cost for below code: 33ba0e02f5/worker/src/main/scala/org/apache/celeborn/service/deploy/worker/Controller.scala (L455-L466)

### Why are the changes needed?

Enhance the performance.
Address comments:
https://github.com/apache/celeborn/pull/2959#pullrequestreview-2466200199
https://github.com/apache/celeborn/pull/2959#issuecomment-2505137166

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA

Closes #2962 from turboFei/CELEBORN_1753_exists.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-23 17:56:20 +08:00
Wang, Fei
2efdf755cc [CELEBORN-1711][TEST] Fix flaky test caused by master/worker setup issue
### What changes were proposed in this pull request?

1. retry on BindException when starting master/worker http server
2. record the used ports and pre-check whether the selected port is used or bounded before binding

### Why are the changes needed?

To fix flaky test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA.

Closes #2906 from turboFei/retry_master_suite.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-12-17 10:45:40 +08:00
Wang, Fei
c893287bea [CELEBORN-1756] Only gauge hdfs metrics if HDFS storage enabled to reduce metrics
### What changes were proposed in this pull request?

If `HDFS` is not defined in the `celeborn.storage.availableTypes`, do not gauge the HDFS metrics.

### Why are the changes needed?

To reduce the metrics number, due there is metrics capacity limitation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #2965 from turboFei/user_metrics.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-12-02 14:11:56 +08:00
zhaohehuhu
3bf91929b6 [CELEBORN-1746] Reduce the size of aws dependencies
### What changes were proposed in this pull request?
Due to the large size of the AWS cloud vendor's client JARs, this PR aims to keep AWS s3 module only to reduce the AWS dependency size from over 296MB to around 2.3MB

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="2560" alt="Screenshot 2024-11-25 at 16 17 52" src="https://github.com/user-attachments/assets/efebbe7d-73cb-47fb-b7fa-9aae052f744b">
tested on lab shown as above picture

Closes #2944 from zhaohehuhu/dev-1125.

Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 19:45:01 +08:00
Sanskar Modi
259dfcd988 [CELEBORN-1621][FOLLOWUP] Support enabling worker tags via config
### What changes were proposed in this pull request?

- Adding support to enable/disable worker tags feature by a master config flag.
- Fixed BUG: After this change #2936, admins can also define the tagsExpr for users. In a case user is passing an empty tagsExpr current code will ignore the admin defined tagsExpr and allow job to use all workers.

### Why are the changes needed?

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?
Existing UTs

Closes #2953 from s0nskar/tags-enabled.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 11:22:35 +08:00
Wang, Fei
59163c2a23 [CELEBORN-1745] Remove application top disk usage code
### What changes were proposed in this pull request?
Remove the code for app top disk usage both in master and worker end.

Prefer to use below prometheus expr to figure out the top app usages.
```
topk(50, sum by (applicationId) (metrics_diskBytesWritten_Value{role="worker", applicationId!=""}))
```

### Why are the changes needed?
To address comments: https://github.com/apache/celeborn/pull/2947#issuecomment-2499564978

> Due to the application dimension resource consumption, this feature should be included in the deprecated features. Maybe you can remove the codes for application top disk usage.

### Does this PR introduce _any_ user-facing change?

Yes, remove the app top disk usage api.

### How was this patch tested?
GA.

Closes #2949 from turboFei/remove_app_top_usage.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-28 10:55:34 +08:00
Sanskar Modi
712d9a496e [CELEBORN-1621][CIP-11] Predefined worker tags expr via dynamic configs
### What changes were proposed in this pull request?

Support predefined tags expression for tenant and users via dynamic config. Using this admin can configure tags for users/tenants and give permission to special users to provide custom tags expression.

### Why are the changes needed?

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn

### Does this PR introduce _any_ user-facing change?

NA

### How was this patch tested?
UTs

Closes #2936 from s0nskar/admin_tags.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-26 20:40:30 +08:00
zhaohehuhu
a2d3972318 [CELEBORN-1530] support MPU for S3
### What changes were proposed in this pull request?

as title

### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

![WechatIMG257](https://github.com/user-attachments/assets/968d9162-e690-4767-8bed-e490e3055753)

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.

<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">

The above screenshot is the second test with 5000 mapper and reducer that I did.

Closes #2830 from zhaohehuhu/dev-1021.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-22 15:03:53 +08:00
SteNicholas
6fc50e36b9
[CELEBORN-1631][FOLLOWUP] Add latest snapshot index in message of HandleResponse for /snapshot/create
### What changes were proposed in this pull request?

Add latest snapshot index in message of `HandleResponse` for `/snapshot/create`.

### Why are the changes needed?

[TakeSnapshotCommand.java#68](https://github.com/apache/ratis/blob/master/ratis-shell/src/main/java/org/apache/ratis/shell/cli/sh/snapshot/TakeSnapshotCommand.java#L68) prints the latest snapshot index. Therefore,  the message of `HandleResponse` for `/snapshot/create` could also provide the latest snapshot index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2908 from SteNicholas/CELEBORN-1631.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-20 11:38:29 +08:00
SteNicholas
32d73c0d75
[CELEBORN-1619][CELEBORN-474][FOLLOWUP] TagsManager uses JavaUtils#newConcurrentHashMap to speed up ConcurrentHashMap#computeIfAbsent
### What changes were proposed in this pull request?

`TagsManager` uses `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.

Follow up #2844.

### Why are the changes needed?

Celeborn supports JDK8, which could meet the bug mentioned in [JDK-8161372](https://bugs.openjdk.org/browse/JDK-8161372). Therefore, it's better to use `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #2922 from SteNicholas/CELEBORN-1619.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-18 16:31:47 +08:00
Erik.fang
8fd44b42ba [CELEBORN-1634] Support queue time/processing time metrics for rpc framework
### What changes were proposed in this pull request?

implement queue time/processing time metrics for rpc framework

### Why are the changes needed?

to identify rpc processing bottelneck

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

local

Closes #2784 from ErikFang/main-rpc-metrics.

Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: 仲甫 <fangming@antgroup.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-18 09:39:56 +08:00
xzh
9e00d726e8
[CELEBORN-1714] Optimize handleApplicationLost
### What changes were proposed in this pull request?
Optimize handleApplicationLost

### Why are the changes needed?
timeoutDeadApplications#ApplicationLost should be handled promptly, rather than being processed in the Master RPC queue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #2910 from xy2953396112/optimize_applost.

Authored-by: xzh <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2024-11-15 19:30:06 +08:00
Wang, Fei
81a0d5113c [CELEBORN-1660] Cache available workers and only count the available workers device free capacity
### What changes were proposed in this pull request?
1. cache the available workers
2. Only count the available workers device free capacity.
3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part

### Why are the changes needed?
Cache  the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
UT.

<img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0">

Closes #2827 from turboFei/device_free.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-14 11:10:45 +08:00
Sanskar Modi
a59ebfe905 [CELEBORN-1619][CIP-11] Integrate tags manager with config service
### What changes were proposed in this pull request?

Integrating TagsManager with configService. Users can now update the workers tags using ConfigService.

### Why are the changes needed?

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
UTs

Closes #2844 from s0nskar/tags_via_config.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-14 09:57:02 +08:00
Wang, Fei
aa625496ee [CELEBORN-1711][TEST] Fix '65535' port is invalid
### What changes were proposed in this pull request?

Fix flaky test caused by invalid port.
```
[info] ApiMasterResourceSuite:
[info] org.apache.celeborn.service.deploy.master.http.api.ApiMasterResourceSuite *** ABORTED ***
[info]   java.lang.IllegalArgumentException: '65535' in celeborn.master.http.port is invalid. Invalid port
```

### Why are the changes needed?

The ports range in CelebornConf is [1024, 65535), 65535 is excluded.

169b6f6973/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala (L2315-L2324)

<img width="928" alt="image" src="https://github.com/user-attachments/assets/4532b1bc-c548-45cd-b836-c493f2904422">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA.

Closes #2901 from turboFei/fix_invalid_port.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2024-11-11 23:39:41 -08:00
Wang, Fei
be4c02e6d0 [CELEBORN-1601][FOLLOWUP] Refine the RESTful apis for revise lost shuffles
### What changes were proposed in this pull request?

1. `GET /api/v1/applications/deleteApps`  -> `DELETE /api/v1/applications`

2. `GET /api/v1/applications/reviseLostShuffles`  -> `POST /api/v1/applications/revise_lost_shuffles`

### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2746

### Does this PR introduce _any_ user-facing change?

No, these APIs has not been released yet.

### How was this patch tested?
GA.

Closes #2892 from turboFei/delete_app.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-12 10:11:25 +08:00
Wang, Fei
330b2a094e [CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
### What changes were proposed in this pull request?

Bump protobuf from 3.21.7 to 3.25.5.

### Why are the changes needed?

To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA.

Closes #2898 from turboFei/bump_protobuf.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-11 17:02:23 +08:00