### What changes were proposed in this pull request?
Support to register application info with user identifier and extra info.
### Why are the changes needed?
To provide more insight for the application information.
### Does this PR introduce _any_ user-facing change?
A new RESTful API.
### How was this patch tested?
UT.
Closes#3428 from turboFei/app_info_uid.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix NettyRpcEnv,Master,Worker `Source` to avoid thread leak
### Why are the changes needed?
1. NettyRpcEnv should clean rpcSource to prevent resource leak.
2. Master clean resourceConsumptionSource, masterSource, threadPoolSource, jVMSource, jVMCPUSource, systemMiscSource
3. Worker clean clean workerSource.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
local
Closes#3418 from xy2953396112/CELEBORN-2104.
Authored-by: dz <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
This PR enhances the Ratis peer add operation to support clientAddress and adminAddress parameters with RESTful api, allowing these critical RPC endpoints to be properly configured when adding new peers to the Celeborn master cluster.
### Why are the changes needed?
Currently, when expanding the Celeborn master cluster using the ratis peer add operation, newly added peers lack clientAddress and adminAddress settings. If a newly added peer becomes the Leader, all Followers will return empty addresses to clients, causing them to attempt connections to an incorrect Leader address (127.0.0.1:0). This change ensures proper client request routing in expanded clusters.
### Does this PR introduce _any_ user-facing change?
Yes, this PR extends the API for adding Ratis peers by adding support for clientAddress and adminAddress parameters. Users will now be able to specify these addresses when adding new peers to the cluster.
### How was this patch tested?
Manual testing of cluster expansion scenarios to ensure clients can correctly connect to the Leader regardless of which peer holds leadership
```
➜ curl -sX POST zw06-data-k8s-sparktest-node007.mt:9098/api/v1/ratis/peer/add \
-H "Content-Type: application/json" \
-d '{ "peers": [{"id": "2", "address": "zw06-data-k8s-sparktest-node009.mt:9872", "clientAddress": "zw06-data-k8s-sparktest-node009.mt:9097", "adminAddress": "zw06-data-k8s-sparktest-node009.mt:9097" }] }' | jq
{
"success": true,
"message": "Successfully added peers ArrayBuffer(2|zw06-data-k8s-sparktest-node009.mt:9872) to group GroupInfoReply:client-3E7C9CE679B2->0group-47BEDE733167, cid=1031, SUCCESS, logIndex=0, commits[0:c224, 1:c224]."
}
➜ curl -s zw06-data-k8s-sparktest-node009.mt:9098/masterGroupInfo
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 0(zw06-data-k8s-sparktest-node007.mt:9872)
[server {
id: "2"
address: "zw06-data-k8s-sparktest-node009.mt:9872"
clientAddress: "zw06-data-k8s-sparktest-node009.mt:9097"
adminAddress: "zw06-data-k8s-sparktest-node009.mt:9097"
startupRole: FOLLOWER
}
commitIndex: 228
, server {
id: "0"
address: "zw06-data-k8s-sparktest-node007.mt:9872"
clientAddress: "zw06-data-k8s-sparktest-node007.mt:9097"
adminAddress: "zw06-data-k8s-sparktest-node007.mt:9097"
startupRole: FOLLOWER
}
commitIndex: 228
, server {
id: "1"
address: "zw06-data-k8s-sparktest-node008.mt:9872"
clientAddress: "zw06-data-k8s-sparktest-node008.mt:9097"
adminAddress: "zw06-data-k8s-sparktest-node008.mt:9097"
startupRole: FOLLOWER
}
commitIndex: 228
]
```
Closes#3452 from gaoyajun02/ratis.
Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Provide user options to enable release workers only with high workload when the number of excluded worker set is too large.
### Why are the changes needed?
In some cases, a large percentage of workers were excluded, but most of them were due to high workload. It's better to release such workers from excluded set to ensure the system availability is a priority.
### Does this PR introduce _any_ user-facing change?
New Configuration Option.
### How was this patch tested?
Unit tests.
Closes#3365 from Kalvin2077/exclude-high-stress-workers.
Lead-authored-by: yuanzhen <yuanzhen.hwk@alibaba-inc.com>
Co-authored-by: Kalvin2077 <wk.huang2077@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
Merge Resource.proto into TransportMessages.proto as per the below design
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-16+Merge+transport+proto+and+resource+proto+files
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#3231 from zhaohehuhu/dev-0425.
Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Refactoring similar code about configuration usage.
### Why are the changes needed?
Improve scalability for possible new remote storage in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#3353 from Kalvin2077/draft.
Authored-by: Kalvin2077 <wk.huang2077@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
When workers with higher workloads are excluded, the master does not have a clear log.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GA
Closes#3391 from cxzl25/CELEBORN-2082.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Inspired by [FLINK-38038](https://issues.apache.org/jira/projects/FLINK/issues/FLINK-38038?filter=allissues]), I used [Tongyi Lingma](https://lingma.aliyun.com/) and qwen3-thinking LLM to identify and fix some typo issues in the Celeborn codebase. For example:
- backLog → backlog
- won`t → won't
- can to be read → can be read
- mapDataPartition → mapPartitionData
- UserDefinePasswordAuthenticationProviderImpl → UserDefinedPasswordAuthenticationProviderImpl
### Why are the changes needed?
Remove typos to improve source code readability for users and ease development for developers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Code and documentation cleanup does not require additional testing.
Closes#3356 from codenohup/fix-typo.
Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Improve check quota message.
### Why are the changes needed?
Make check quota message clearer.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#3328 from leixm/follow_CELEBORN-1577.
Authored-by: Xianming Lei <xianming.lei@shopee.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
This PR is part of [CIP17: Interruption Aware Slot Selection](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-17%3A+Interruption+Aware+Slot+Selection).
It introduces a REST api for external services to notify master about interruptions/schedules.
### Why are the changes needed?
To nofify master of upcoming interruption notices in the worker fleet. Master can then use these to proactively deprioritize workers that might be in scope for interruption sooner.
### Does this PR introduce _any_ user-facing change?
new rest api
### How was this patch tested?
added unit tests.
Closes#3285 from akpatnam25/CELEBORN-2014.
Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
Support min number of workers to assign slots on for a shuffle.
### Why are the changes needed?
PR https://github.com/apache/celeborn/pull/3039 updated the default value of `celeborn.master.slot.assign.extraSlots` to avoid skew in shuffle stage with less number of reducers. However, it will also affect the stage with large number of reducers, thus not ideal.
We are introducing a new config `celeborn.master.slot.assign.minWorkers` which will ensure that shuffle stages with less number of reducers will not cause load imbalance on few nodes.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
NA
Closes#3297 from s0nskar/min_workers.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Revert role name change in [CELEBORN-1627](https://github.com/apache/celeborn/pull/2777)
### Why are the changes needed?
Fix the issue where the case of name affects the metrics dashboard
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual
Closes#3299 from RexXiong/CELEBORN-1627-FOLLOWUP.
Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
SlotsAllocator should select disks randomly in RoundRobin mode
### Why are the changes needed?
The current round robin selection mechanism is to select the first disk of each worker first, then the second disk of each worker, and finally the third disk. This can easily cause disk storage space skew. We should select disks randomly instead of selecting the first disk first.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#3275 from leixm/CELEBORN-2008.
Authored-by: Xianming Lei <xianming.lei@shopee.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Improve Aliyun OSS support including `SlotsAllocator#offerSlotsLoadAware`, `Worker#heartbeatToMaster` and `PartitionDataWriter#getStorageInfo`.
### Why are the changes needed?
There are many methods where OSS support is lacking in `SlotsAllocator#offerSlotsLoadAware`, `Worker#heartbeatToMaster` and `PartitionDataWriter#getStorageInfo`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3268 from SteNicholas/CELEBORN-1916.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application.
### Why are the changes needed?
There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.
Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
- `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
Closes#3026 from SteNicholas/CELEBORN-1800.
Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Introduce disruptor dependency to support asynchronous logging of log4j2.
### Why are the changes needed?
We add `-Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector` in `CELEBORN_MASTER_JAVA_OPTS` and `CELEBORN_WOKRER_JAVA_OPTS` for production environment. `AsyncLoggerContextSelector` depends on disruptor dependency. Therefore, it's recommend to introduce disruptor dependency to support log4j2 asynchronous loggers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#3246 from SteNicholas/CELEBORN-1994.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
After profiling to see where the hotspots are for slot selection, we identified 2 main areas:
- iter.remove ([link](https://github.com/apache/celeborn/blob/main/master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java#L447)) is a major hotspot, especially if partitionIdList is massive - since it is an ArrayList and we are removing from the begining - resulting in O(n) deletion costs.
- `haveDisk` is computed per partitionId, iterated across all workers. We precompute this and store it as a field in `WorkerInfo`.
See the below flamegraph for the hotspot of `iter.remove` (`oop_disjoint_arraycopy`) after running a benchmark.

Below is what we actually observed in production which matches with the above observation from the benchmark:

### Why are the changes needed?
speed up slot selection performance in the case of large partitionIds
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
After applying the above changes, we can see the hotspot is removed in the flamegraph:

Benchmarks:
Without changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 2060198.745 ±(99.9%) 306976.270 us/op
# Warmup Iteration 2: 1137534.950 ±(99.9%) 72065.776 us/op
# Warmup Iteration 3: 1032434.221 ±(99.9%) 59585.256 us/op
# Warmup Iteration 4: 903621.382 ±(99.9%) 41542.172 us/op
# Warmup Iteration 5: 921816.398 ±(99.9%) 44025.884 us/op
Iteration 1: 853276.360 ±(99.9%) 13285.688 us/op
Iteration 2: 865183.111 ±(99.9%) 9691.856 us/op
Iteration 3: 909971.254 ±(99.9%) 10201.037 us/op
Iteration 4: 874154.240 ±(99.9%) 11287.538 us/op
Iteration 5: 907655.363 ±(99.9%) 11893.789 us/op
Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
882048.066 ±(99.9%) 98360.936 us/op [Average]
(min, avg, max) = (853276.360, 882048.066, 909971.254), stdev = 25544.023
CI (99.9%): [783687.130, 980409.001] (assumes normal distribution)
# Run complete. Total time: 00:05:43
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 882048.066 ± 98360.936 us/op
Process finished with exit code 0
```
With changes:
```
# Detecting actual CPU count: 12 detected
# JMH version: 1.37
# VM version: JDK 1.8.0_172, Java HotSpot(TM) 64-Bit Server VM, 25.172-b11
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home/jre/bin/java
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 5 s each
# Measurement: 5 iterations, 60 s each
# Timeout: 10 min per iteration
# Threads: 12 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection
# Run progress: 0.00% complete, ETA 00:05:25
# Fork: 1 of 1
# Warmup Iteration 1: 305437.719 ±(99.9%) 81860.733 us/op
# Warmup Iteration 2: 137498.811 ±(99.9%) 7669.102 us/op
# Warmup Iteration 3: 129355.869 ±(99.9%) 5030.972 us/op
# Warmup Iteration 4: 135311.734 ±(99.9%) 6964.080 us/op
# Warmup Iteration 5: 131013.323 ±(99.9%) 8560.232 us/op
Iteration 1: 133695.396 ±(99.9%) 3713.684 us/op
Iteration 2: 143735.961 ±(99.9%) 5858.078 us/op
Iteration 3: 135619.704 ±(99.9%) 5257.352 us/op
Iteration 4: 128806.160 ±(99.9%) 4541.790 us/op
Iteration 5: 134179.546 ±(99.9%) 5137.425 us/op
Result "org.apache.celeborn.service.deploy.master.SlotsAllocatorBenchmark.benchmarkSlotSelection":
135207.354 ±(99.9%) 20845.544 us/op [Average]
(min, avg, max) = (128806.160, 135207.354, 143735.961), stdev = 5413.522
CI (99.9%): [114361.809, 156052.898] (assumes normal distribution)
# Run complete. Total time: 00:05:29
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
SlotsAllocatorBenchmark.benchmarkSlotSelection avgt 5 135207.354 ± 20845.544 us/op
Process finished with exit code 0
```
882048.066 us/ops without changes vs 135207.354 us/op with changes. That is about 6.5x improvement.
Closes#3228 from akpatnam25/CELEBORN-1982.
Lead-authored-by: Aravind Patnam <akpatnam25@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
Upgrade PB version as fist step as per below design
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-16+Merge+transport+proto+and+resource+proto+files
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#3201 from zhaohehuhu/dev-0403.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
- close [CELEBORN-1916](https://issues.apache.org/jira/browse/CELEBORN-1916)
- This PR extends the Multipart Uploader (MPU) interface to support Aliyun OSS.
### Why are the changes needed?
- Implemented multipart-uploader-oss module based on the existing MPU extension interface.
- Added necessary configurations and dependencies for Aliyun OSS integration.
- Ensured compatibility with the existing multipart-uploader framework.
- This enhancement allows seamless multipart upload functionality for Aliyun OSS, similar to the existing AWS S3 support.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Deployment integration testing has been completed in the local environment.
Closes#3157 from shouwangyw/optimize/mpu-oss.
Lead-authored-by: veli.yang <897900564@qq.com>
Co-authored-by: yangwei <897900564@qq.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Follow up for https://github.com/apache/celeborn/pull/2819
1. add timer for UpdateResourceConsumptionTime
2. prevent NPE if metrics not found
### Why are the changes needed?
The timer not added and cause NPE.
```
25/03/31 13:18:48,219 WARN [master-quota-checker] MasterSource: Metric UpdateResourceConsumptionTime{instance="zeus-slc-cm-1.zeus-slc-cm-cm.hm-dev.svc.140.tess.io:8080",role="master"} not found!
25/03/31 13:18:48,220 WARN [master-quota-checker] MasterSource: Exception encountered during stop timer of metric UpdateResourceConsumptionTime{instance="zeus-slc-cm-1.zeus-slc-cm-cm.hm-dev.svc.140.tess.io:8080",role="master"}
scala.MatchError: null
at org.apache.celeborn.common.metrics.source.AbstractSource.doStopTimer(AbstractSource.scala:316)
at org.apache.celeborn.common.metrics.source.AbstractSource.sample(AbstractSource.scala:279)
at org.apache.celeborn.service.deploy.master.quota.QuotaManager.updateResourceConsumption(QuotaManager.scala:201)
at org.apache.celeborn.service.deploy.master.quota.QuotaManager$$anon$1.run(QuotaManager.scala:59)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#3190 from turboFei/fix_npe.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. Worker reports resourceConsumption to master
2. QuotaManager calculates the resourceConsumption of each app and marks the apps that exceed the quota.
2.1 When the tenant's resourceConsumption exceeds the tenant's quota, select the app with a larger consumption to mark interrupted.
2.2 When the resourceConsumption of the cluster exceeds the cluster quota, select the app with larger consumption to mark interrupted.
3. Master returns to Driver through heartbeat, whether app is marked interrupted
### Why are the changes needed?
The current storage quota logic can only limit new shuffles, and cannot limit the writing of existing shuffles. In our production environment, there is such an scenario: the cluster is small, but the user's app single shuffle is large which occupied disk resources, we want to interrupt those shuffle.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UTs.
Closes#2819 from leixm/CELEBORN-1577-2.
Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix incorrect logic when calculate disk available slots
### Why are the changes needed?
Now we use `usableSize / estimatedPartitionSize = maxSlots`
Then `availableSlots = maxSlots - allocatedSlots`
But `availableSlots` should be `usableSize / estimizatedPartitionSize`
### Does this PR introduce _any_ user-facing change?
Yea
### How was this patch tested?
MT
Closes#3162 from AngersZhuuuu/CELEBORN-1923.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
Support pre-run static code blocks of `TransportMessages` to improve performance of protobuf serialization.
### Why are the changes needed?
The protobuf message protocol defines many map type fields, which makes it time-consuming to build these message instances. This is because `TransportMessages` contains static code blocks to initialize a large number of `Descriptor`s and `FieldAccessorTable`s, where the instantiation of `FieldAccessorTable` includes reflection. The test result proves that the static code blocks execute in about 70 milliseconds.
Therefore, it's better to pre-run static code blocks of `TransportMessages` to improve performance of protobuf serialization. Meanwhile, it's recommended to use repeated instead of map type field for rpc messages.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#3149 from SteNicholas/CELEBORN-1909.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Support to get the workers topology information with RESTful api.
1. return networkLocation in WorkerData
2. add new api `/api/v1/workers/topology` to return the grouped workers topology info.
### Why are the changes needed?
1. To get the workers topology information.
2. To know the rack awareness well.
### Does this PR introduce _any_ user-facing change?
No break change.
### How was this patch tested?
UT and IT.
<img width="1008" alt="image" src="https://github.com/user-attachments/assets/6cb1aa2a-1160-4570-acb1-7602e2ce0b09" />
<img width="719" alt="image" src="https://github.com/user-attachments/assets/d26c3326-4837-40ad-a344-3cb4204bf607" />
Closes#3112 from turboFei/topology.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
This PR optimizes the RoundRobin algorithm to achieve a more balanced disk slot allocation across workers.
Previously, when allocating 3000 partitions using RoundRobin, the slot distribution across worker disks was [668, 666, 666], which resulted in one disk having 2 more slots than the others.
After the optimization, the slot distribution is now [667, 667, 666], ensuring a more balanced allocation.
### Why are the changes needed?
The changes are necessary to improve load balancing across worker disks, reducing the risk of overloading a single disk. This ensures a more predictable and fair distribution of slots, which can lead to better performance and resource utilization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#3074 from gaoyajun02/1843.
Authored-by: gaoyajun02 <gaoyajun02@meituan.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Using Operation description instead of ApiResponse description for RESTful APIs.
### Why are the changes needed?
Make the API description in correct place.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Before:
<img width="1434" alt="image" src="https://github.com/user-attachments/assets/7a92c4a7-d550-4221-bee2-d52918719521" />
After:
<img width="1433" alt="image" src="https://github.com/user-attachments/assets/0287b425-7b56-4ef7-ba5d-4b53b4208780" />
Closes#3038 from turboFei/api_desc.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Add two metrics (raft commitIndex of each master and maxCommitIndex - minCommitIndex value).
### Why are the changes needed?
To observe the metadata synchronization of the raft cluster.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.

Closes#3063 from zaynt4606/clb1831.
Authored-by: zhengtao <shuaizhentao.szt@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Minor fix the v1 RESTful apis before 0.6.0 release.
1. update the API description to use UPPER case worker EventType
2. `subResourceConsumption` => `subResourceConsumptions`.
### Why are the changes needed?
1. With https://github.com/apache/celeborn/pull/2754, the openapi-sdk works well. but for the RESTful call without SDK, the worker eventType is still case sensitive, might be caused by the jersey issue mentioned in https://github.com/eclipse-ee4j/jersey/issues/5288. So, In this PR, I change the description in the swagger for user guidance.
<img width="1524" alt="image" src="https://github.com/user-attachments/assets/70e4f239-dc36-47bc-902e-5340986f014a" />
2. rename `subResourceConsumption` to `subResourceConsumptions`.
### Does this PR introduce _any_ user-facing change?
No, the api has not been released.
### How was this patch tested?
GA.
Closes#3023 from turboFei/restful_minor_fix.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
To support Spark 4.0.0 preview.
### Why are the changes needed?
1. Changed Scala to 2.13.
2. Introduce columnar shuffle module for spark 4.0.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Cluster test.
Closes#2813 from FMX/b1413.
Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. rename the RPC metrics name from `${name}_${metric}` to `Rpc${metric}{name=$name}` so that it is easy to add into grafana dashboard
2. Use MASTER/WORKER/CLIENT Role for rpc env.
3. add the rpc metrics into grafana dashboard.
### Why are the changes needed?
For monitoring
### Does this PR introduce _any_ user-facing change?
No, it has not been released
### How was this patch tested?
UT for metrics source `instance`.
<img width="1456" alt="image" src="https://github.com/user-attachments/assets/90284390-54ad-49ef-a868-fa537d2301b8">
<img width="1880" alt="image" src="https://github.com/user-attachments/assets/e8101e47-d649-4c66-9978-1efb4faa047f">
Closes#2990 from turboFei/rpc_metrics.
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
1. retry on BindException when starting master/worker http server
2. record the used ports and pre-check whether the selected port is used or bounded before binding
### Why are the changes needed?
To fix flaky test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2906 from turboFei/retry_master_suite.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
If `HDFS` is not defined in the `celeborn.storage.availableTypes`, do not gauge the HDFS metrics.
### Why are the changes needed?
To reduce the metrics number, due there is metrics capacity limitation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2965 from turboFei/user_metrics.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Due to the large size of the AWS cloud vendor's client JARs, this PR aims to keep AWS s3 module only to reduce the AWS dependency size from over 296MB to around 2.3MB
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
<img width="2560" alt="Screenshot 2024-11-25 at 16 17 52" src="https://github.com/user-attachments/assets/efebbe7d-73cb-47fb-b7fa-9aae052f744b">
tested on lab shown as above picture
Closes#2944 from zhaohehuhu/dev-1125.
Authored-by: zhaohehuhu <luoyedeyi@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
- Adding support to enable/disable worker tags feature by a master config flag.
- Fixed BUG: After this change #2936, admins can also define the tagsExpr for users. In a case user is passing an empty tagsExpr current code will ignore the admin defined tagsExpr and allow job to use all workers.
### Why are the changes needed?
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Existing UTs
Closes#2953 from s0nskar/tags-enabled.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Remove the code for app top disk usage both in master and worker end.
Prefer to use below prometheus expr to figure out the top app usages.
```
topk(50, sum by (applicationId) (metrics_diskBytesWritten_Value{role="worker", applicationId!=""}))
```
### Why are the changes needed?
To address comments: https://github.com/apache/celeborn/pull/2947#issuecomment-2499564978
> Due to the application dimension resource consumption, this feature should be included in the deprecated features. Maybe you can remove the codes for application top disk usage.
### Does this PR introduce _any_ user-facing change?
Yes, remove the app top disk usage api.
### How was this patch tested?
GA.
Closes#2949 from turboFei/remove_app_top_usage.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Support predefined tags expression for tenant and users via dynamic config. Using this admin can configure tags for users/tenants and give permission to special users to provide custom tags expression.
### Why are the changes needed?
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
UTs
Closes#2936 from s0nskar/admin_tags.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
as title
### Why are the changes needed?
AWS S3 doesn't support append, so Celeborn had to copy the historical data from s3 to worker and write to s3 again, which heavily scales out the write. This PR implements a better solution via MPU to avoid copy-and-write.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?

I conducted an experiment with a 1GB input dataset to compare the performance of Celeborn using only S3 storage versus using SSD storage. The results showed that Celeborn with SSD storage was approximately three times faster than with only S3 storage.
<img width="1728" alt="Screenshot 2024-11-16 at 13 02 10" src="https://github.com/user-attachments/assets/8f879c47-c01a-4004-9eae-1c266c1f3ef2">
The above screenshot is the second test with 5000 mapper and reducer that I did.
Closes#2830 from zhaohehuhu/dev-1021.
Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: He Zhao <luoyedeyi459@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Add latest snapshot index in message of `HandleResponse` for `/snapshot/create`.
### Why are the changes needed?
[TakeSnapshotCommand.java#68](https://github.com/apache/ratis/blob/master/ratis-shell/src/main/java/org/apache/ratis/shell/cli/sh/snapshot/TakeSnapshotCommand.java#L68) prints the latest snapshot index. Therefore, the message of `HandleResponse` for `/snapshot/create` could also provide the latest snapshot index.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2908 from SteNicholas/CELEBORN-1631.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
`TagsManager` uses `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.
Follow up #2844.
### Why are the changes needed?
Celeborn supports JDK8, which could meet the bug mentioned in [JDK-8161372](https://bugs.openjdk.org/browse/JDK-8161372). Therefore, it's better to use `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#2922 from SteNicholas/CELEBORN-1619.
Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
implement queue time/processing time metrics for rpc framework
### Why are the changes needed?
to identify rpc processing bottelneck
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
local
Closes#2784 from ErikFang/main-rpc-metrics.
Lead-authored-by: Erik.fang <fmerik@gmail.com>
Co-authored-by: 仲甫 <fangming@antgroup.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Optimize handleApplicationLost
### Why are the changes needed?
timeoutDeadApplications#ApplicationLost should be handled promptly, rather than being processed in the Master RPC queue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Closes#2910 from xy2953396112/optimize_applost.
Authored-by: xzh <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
### What changes were proposed in this pull request?
1. cache the available workers
2. Only count the available workers device free capacity.
3. place the metrics_AvailableWorkerCount_Value in overall and metrics_WorkerCount_Value in `Master` part
### Why are the changes needed?
Cache the available workers to reduce the computation that need to loop the workers frequently.
To have an accurate device capacity overview that does not include the excluded workers, decommissioning workers, etc.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
UT.
<img width="1705" alt="image" src="https://github.com/user-attachments/assets/bee17b4e-785d-4112-8410-dbb684270ec0">
Closes#2827 from turboFei/device_free.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Integrating TagsManager with configService. Users can now update the workers tags using ConfigService.
### Why are the changes needed?
https://cwiki.apache.org/confluence/display/CELEBORN/CIP-11+Supporting+Tags+in+Celeborn
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
UTs
Closes#2844 from s0nskar/tags_via_config.
Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
### What changes were proposed in this pull request?
Fix flaky test caused by invalid port.
```
[info] ApiMasterResourceSuite:
[info] org.apache.celeborn.service.deploy.master.http.api.ApiMasterResourceSuite *** ABORTED ***
[info] java.lang.IllegalArgumentException: '65535' in celeborn.master.http.port is invalid. Invalid port
```
### Why are the changes needed?
The ports range in CelebornConf is [1024, 65535), 65535 is excluded.
169b6f6973/common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala (L2315-L2324)
<img width="928" alt="image" src="https://github.com/user-attachments/assets/4532b1bc-c548-45cd-b836-c493f2904422">
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2901 from turboFei/fix_invalid_port.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### What changes were proposed in this pull request?
1. `GET /api/v1/applications/deleteApps` -> `DELETE /api/v1/applications`
2. `GET /api/v1/applications/reviseLostShuffles` -> `POST /api/v1/applications/revise_lost_shuffles`
### Why are the changes needed?
Followup for https://github.com/apache/celeborn/pull/2746
### Does this PR introduce _any_ user-facing change?
No, these APIs has not been released yet.
### How was this patch tested?
GA.
Closes#2892 from turboFei/delete_app.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
### What changes were proposed in this pull request?
Bump protobuf from 3.21.7 to 3.25.5.
### Why are the changes needed?
To fix CVE: https://github.com/advisories/GHSA-735f-pc8j-v9w8
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA.
Closes#2898 from turboFei/bump_protobuf.
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>