Commit Graph

955 Commits

Author SHA1 Message Date
SteNicholas
7f08eb8f1d [CELEBORN-2137] Remove unused MAPGROUP PartitionType
### What changes were proposed in this pull request?

Remove unused `MAPGROUP` `PartitionType`.

### Why are the changes needed?

`PartitionType` `MAPGROUP` is unused at present, which could be removed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3459 from SteNicholas/CELEBORN-2137.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-09-03 09:58:07 +08:00
Wang, Fei
d038dd2b32 [CELEBORN-1258] Support to register application info with user identifier and extra info
### What changes were proposed in this pull request?
Support to register application info with user identifier and extra info.

### Why are the changes needed?
To provide more insight for the application information.

### Does this PR introduce _any_ user-facing change?
A new RESTful API.

### How was this patch tested?
UT.

Closes #3428 from turboFei/app_info_uid.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-09-01 11:15:40 +08:00
dz
2817f7fb9e [CELEBORN-2104] Clean up sources of NettyRpcEnv, Master and Worker to avoid thread leaks
### What changes were proposed in this pull request?

Fix NettyRpcEnv,Master,Worker `Source` to avoid thread leak

### Why are the changes needed?

1. NettyRpcEnv should clean rpcSource to prevent resource leak.
2. Master clean resourceConsumptionSource, masterSource, threadPoolSource, jVMSource, jVMCPUSource, systemMiscSource
3. Worker clean clean workerSource.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

local

Closes #3418 from xy2953396112/CELEBORN-2104.

Authored-by: dz <953396112@qq.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-29 19:04:19 +08:00
SteNicholas
1a3b9f35b5 [CELEBORN-2129] CelebornBufferStream should invoke openStreamInternal in moveToNextPartitionIfPossible to avoid client creation timeout
### What changes were proposed in this pull request?

`CelebornBufferStream` should invoke `openStreamInternal` in `moveToNextPartitionIfPossible` to avoid client creation timeout.

### Why are the changes needed?

There are many `CelebornIOException` that is cause by timeout client creation in production environment as follows:

```
2025-08-22 16:20:10,681 INFO  [flink-akka.actor.default-dispatcher-40] org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - [vertex-2]Calc(select=[lz4sql, rawsize, obcluster, ds, hh, mm, PROCTIME() AS $6]) -> Sort(orderBy=[lz4sql ASC, rawsize ASC, obcluster ASC, ds ASC, hh ASC, mm ASC, $6 DESC]) -> OverAggregate(partitionBy=[lz4sql, rawsize, obcluster, ds, hh, mm], orderBy=[$6 DESC], window#0=[ROW_NUMBER(*) AS w0$o0 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW], select=[lz4sql, rawsize, obcluster, ds, hh, mm, $6, w0$o0]) -> Calc(select=[lz4sql, rawsize, obcluster, ds, hh, mm], where=[(w0$o0 = 1)]) (668/1900) (d8bf48183d8c69a1ab84bcd445f6d4ed_0e8289f2bf927649dd2511bbc2bb6759_667_0) switched from RUNNING to FAILED on antc4flink4172792604-taskmanager-403  (dataPort=1).
java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Connecting to /:9093 timed out (60000 ms)
	at org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:313)
	at org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250)
	at org.apache.celeborn.common.network.client.TransportClientFactory.retryCreateClient(TransportClientFactory.java:157)
	at org.apache.celeborn.plugin.flink.network.FlinkTransportClientFactory.createClientWithRetry(FlinkTransportClientFactory.java:51)
	at org.apache.celeborn.plugin.flink.readclient.CelebornBufferStream.openStreamInternal(CelebornBufferStream.java:200)
	at org.apache.celeborn.plugin.flink.readclient.CelebornBufferStream.moveToNextPartitionIfPossible(CelebornBufferStream.java:183)
	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.onStreamEnd(RemoteBufferStreamReader.java:161)
	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.lambda$new$0(RemoteBufferStreamReader.java:79)
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.processMessageInternal(ReadClientHandler.java:64)
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.receive(ReadClientHandler.java:100)
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.receive(ReadClientHandler.java:111)
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.receive(ReadClientHandler.java:76)
	at org.apache.celeborn.common.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:100)
	at org.apache.celeborn.common.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:84)
	at org.apache.celeborn.common.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:156)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at org.apache.celeborn.shaded.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at org.apache.celeborn.plugin.flink.network.TransportFrameDecoderWithBufferSupplier.decodeBody(TransportFrameDecoderWithBufferSupplier.java:95)
	at org.apache.celeborn.plugin.flink.network.TransportFrameDecoderWithBufferSupplier.channelRead(TransportFrameDecoderWithBufferSupplier.java:184)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at org.apache.celeborn.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:991)

	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.errorReceived(RemoteBufferStreamReader.java:146) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.lambda$new$0(RemoteBufferStreamReader.java:77) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.readclient.CelebornBufferStream.moveToNextPartitionIfPossible(CelebornBufferStream.java:193) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.onStreamEnd(RemoteBufferStreamReader.java:161) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.lambda$new$0(RemoteBufferStreamReader.java:79) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.processMessageInternal(ReadClientHandler.java:64) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.receive(ReadClientHandler.java:100) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.receive(ReadClientHandler.java:111) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.receive(ReadClientHandler.java:76) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.common.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:100) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.common.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:84) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.common.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:156) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.network.TransportFrameDecoderWithBufferSupplier.decodeBody(TransportFrameDecoderWithBufferSupplier.java:95) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.plugin.flink.network.TransportFrameDecoderWithBufferSupplier.channelRead(TransportFrameDecoderWithBufferSupplier.java:184) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[celeborn-client-flink-1.18-shaded_2.12-0.5.4-ANT.jar:?]
	at java.lang.Thread.run(Thread.java:991) ~[?:?]
```

`CelebornBufferStream` should invoke `openStreamInternal` in `moveToNextPartitionIfPossible` to avoid client creation timeout, which is caused by creating a client using the callback thread of netty.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #3450 from SteNicholas/CELEBORN-2129.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-08-27 14:21:15 +08:00
yuanzhen
8effb735f7 [CELEBORN-2066] Release workers only with high workload when the number of excluded worker set is too large
### What changes were proposed in this pull request?

Provide user options to enable release workers only with high workload when the number of excluded worker set is too large.

### Why are the changes needed?

In some cases, a large percentage of workers were excluded, but most of them were due to high workload. It's better to release such workers from excluded set to ensure the system availability is a priority.

### Does this PR introduce _any_ user-facing change?

New Configuration Option.

### How was this patch tested?
Unit tests.

Closes #3365 from Kalvin2077/exclude-high-stress-workers.

Lead-authored-by: yuanzhen <yuanzhen.hwk@alibaba-inc.com>
Co-authored-by: Kalvin2077 <wk.huang2077@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-08-22 10:14:38 +08:00
TheodoreLx
1ead784fa1 [CELEBORN-2085] Use a fixed buffer for flush copying to reduce GC
### What changes were proposed in this pull request?

Apply for a byte array in advance and use it as a transfer when copying is needed during flush

### Why are the changes needed?

For HdfsFlushTask, OssFlushTask, and S3FlushTask, you need to copy the CompositeByteBuf in the parameter to a byte array when flushing, and then use the respective clients to write the byte array to the storage.
When the flush throughput rate is very high, this copying will cause very serious GC problems and affect the performance of the worker

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

cluster test

Closes #3394 from TheodoreLx/copy-on-flush.

Authored-by: TheodoreLx <1548069580@qq.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-08-08 13:57:21 +08:00
zhaohehuhu
a498b1137f [CELEBORN-1984] Merge ResourceRequest to transportMessageProtobuf
### What changes were proposed in this pull request?
as title

### Why are the changes needed?

Merge Resource.proto into TransportMessages.proto as per the below design

https://cwiki.apache.org/confluence/display/CELEBORN/CIP-16+Merge+transport+proto+and+resource+proto+files

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #3231 from zhaohehuhu/dev-0425.

Lead-authored-by: zhaohehuhu <luoyedeyi@163.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-08-01 23:28:32 +08:00
SteNicholas
3ff44fae3f [CELEBORN-894][CELEBORN-474][FOLLOWUP] PushState uses JavaUtils#newConcurrentHashMap to speed up ConcurrentHashMap#computeIfAbsent
### What changes were proposed in this pull request?

`PushState`  uses `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.

### Why are the changes needed?

Celeborn supports JDK8, which could meet the bug mentioned in [JDK-8161372](https://bugs.openjdk.org/browse/JDK-8161372). Therefore, it's better to use `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3396 from SteNicholas/CELEBORN-894.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-29 10:30:31 +08:00
Kalvin2077
c6e68fddfa [CELEBORN-2053] Refactor remote storage configration usage
### What changes were proposed in this pull request?

Refactoring similar code about configuration usage.

### Why are the changes needed?

Improve scalability for possible new remote storage in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #3353 from Kalvin2077/draft.

Authored-by: Kalvin2077 <wk.huang2077@outlook.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-07-28 16:56:32 +08:00
SteNicholas
ae40222351 [CELEBORN-2047] Support MapPartitionData on DFS
### What changes were proposed in this pull request?

Support `MapPartitionData` on DFS.

### Why are the changes needed?

`MapPartitionData` only supports on local, which does not support on DFS. It's recommended to support `MapPartitionData` on DFS for MapPartition mode to align the ability of ReducePartition mode.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`WordCountTestWithHDFS`.

Closes #3349 from SteNicholas/CELEBORN-2047.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-26 22:11:32 +08:00
Wang, Fei
c587f33aaf [CELEBORN-1793] Add netty pinned memory metrics
### What changes were proposed in this pull request?
Add netty pinned memory metrics

### Why are the changes needed?
We can know more accurately the memory actually allocated from PoolArena.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing uts.

Closes #3019 from leixm/CELEBORN-1793.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-25 17:09:42 +08:00
duanhao-jk
29ab16989d [CELEBORN-2056] Make the wait time for the client to read non shuffle partitions configurable
### What changes were proposed in this pull request?

Added a configuration for client to read non shuffle partition waiting time

### Why are the changes needed?

When the shuffle data of a task is relatively small and there are many empty shuffle partitions, it will take a lot of time for invalid waiting here

### Does this PR introduce _any_ user-facing change?

add configurable

### How was this patch tested?

production environment validation

Closes #3358 from dh20/celeborn_add-20250707.

Lead-authored-by: duanhao-jk <duanhao-jk@360shuke.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-07-24 23:20:34 -07:00
TheodoreLx
4656bcb98a [CELEBORN-2071] Fix the issue where some gauge metrics were not registered to the metricRegistry
### What changes were proposed in this pull request?

add code to register gauge to metricRegistry in addGauge method

### Why are the changes needed?

Because one implementation of addGauge does not register the gauge to the metricRegistry, some Sink implementation classes cannot export these gauge metrics. For example, you cannot see JVMCPUTime in the metric file printed by CsvSink, because CsvSink only prints metrics from the registry variable in MetricsSystem.
### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

cluster test

Closes #3369 from TheodoreLx/fix-gauge-missing.

Lead-authored-by: TheodoreLx <1548069580@qq.com>
Co-authored-by: TheodoreLx <1548069580@qq.com >
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-24 16:15:21 +08:00
SteNicholas
66856f21b3 [CELEBORN-2077] Improve toString by JEP-280 instead of ToStringBuilder
### What changes were proposed in this pull request?

Improve `toString` by JEP-280 instead of `ToStringBuilder`.

### Why are the changes needed?

Since Java 9, String Concatenation has been handled better by default.

ID | DESCRIPTION
-- | --
JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280)

Backport https://github.com/apache/spark/pull/51572.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3380 from SteNicholas/CELEBORN-2077.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-07-22 22:50:01 -07:00
DDDominik
0ed590dc81 [CELEBORN-1917] Support celeborn.client.push.maxBytesSizeInFlight
### What changes were proposed in this pull request?
add data size limitation to inflight data by introducing a new configuration: `celeborn.client.push.maxBytesInFlight.perWorker/total` and defaults to `celeborn.client.push.buffer.max.size * celeborn.client.push.maxReqsInFlight.perWorker/total`.
for backward compatibility, also add a control: `celeborn.client.push.maxReqsInFlight.enabled`.

### Why are the changes needed?
celeborn do supports limiting the number of push inflight requests via `celeborn.client.push.maxReqsInFlight.perWorker/total`. this is a good constraint to memory usage where most requests do not exceed `celeborn.client.push.buffer.max.size`. however, in a vectorized shuffle (like blaze and gluten), a request might be greatly larger then the max buffer size, leading to too much inflight data and results OOM.

### Does this PR introduce _any_ user-facing change?
Yes, add new  config for client

### How was this patch tested?
test on local env

Closes #3248 from DDDominik/CELEBORN-1917.

Lead-authored-by: DDDominik <1015545832@qq.com>
Co-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: DDDominik <zhuangxian@kuaishou.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-22 23:07:56 +08:00
SteNicholas
05fca23ed2 [MINOR] Fix a typo buffer to body in ChunkFetchSuccess.toString
### What changes were proposed in this pull request?

Fix a typo `buffer` to `body` in `ChunkFetchSuccess.toString`.

### Why are the changes needed?

Since the field name of `AbstractMessage` is body, all the other places show `body=...` instead of `buffer=...`. We had better fix this typo for consistency.

Backport https://github.com/apache/spark/pull/51570.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3378 from SteNicholas/chunk-fetch-success.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-21 13:31:18 +08:00
SteNicholas
cf3c05d668 [CELEBORN-2068] TransportClientFactory should close channel explicitly to avoid resource leak for timeout or failure
### What changes were proposed in this pull request?

`TransportClientFactory` should close channel explicitly to avoid resource leak for timeout or failure.

### Why are the changes needed?

There is resource leak risk for timeout or failure in `TransportClientFactory#internalCreateClient`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #3368 from SteNicholas/CELEBORN-2068.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-18 17:50:08 +08:00
SteNicholas
6a0e19c076 [CELEBORN-2067] Clean up deprecated Guava API usage
### What changes were proposed in this pull request?

Clean up deprecated Guava API usage.

### Why are the changes needed?

There are deprecated Guava API usage, including:

1. Made modifications to Throwables.propagate with reference to https://github.com/google/guava/wiki/Why-we-deprecated-Throwables.propagate

- For cases where it is known to be a checked exception, including `IOException`, `GeneralSecurityException`, `SaslException`, and `RocksDBException`, none of which are subclasses of RuntimeException or Error, directly replaced Throwables.propagate(e) with `throw new RuntimeException(e);`.

- For cases where it cannot be determined whether it is a checked exception or an unchecked exception or Error, use

```
throwIfUnchecked(e);
throw new RuntimeException(e);
```

to replace `Throwables.propagate(e)`.

0c33dd12b1/guava/src/com/google/common/base/Throwables.java (L199-L235)

```
  /**
   * ...
   * deprecated To preserve behavior, use {code throw e} or {code throw new RuntimeException(e)}
   *     directly, or use a combination of {link #throwIfUnchecked} and {code throw new
   *     RuntimeException(e)}. But consider whether users would be better off if your API threw a
   *     different type of exception. For background on the deprecation, read <a
   *     href="https://goo.gl/Ivn2kc">Why we deprecated {code Throwables.propagate}</a>.
   */
  CanIgnoreReturnValue
  J2ktIncompatible
  GwtIncompatible
  Deprecated
  public static RuntimeException propagate(Throwable throwable) {
    throwIfUnchecked(throwable);
    throw new RuntimeException(throwable);
  }
```

Backport https://github.com/apache/spark/pull/48248

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3367 from SteNicholas/CELEBORN-2067.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-18 17:39:50 +08:00
Aravind Patnam
765265a87d [CELEBORN-2031] Interruption Aware Slot Selection
### What changes were proposed in this pull request?
This PR is part of [CIP17: Interruption Aware Slot Selection](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-17%3A+Interruption+Aware+Slot+Selection).

It makes the changes in the slot selection logic to prioritize workers that do not have interruption "soon". See more context about the slot selection logic [here](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=362056201#CIP17:InterruptionAwareSlotSelection-SlotsAllocator).

### Why are the changes needed?
see [CIP17: Interruption Aware Slot Selection](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-17%3A+Interruption+Aware+Slot+Selection).

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
unit tests. This is also already in production in our cluster for last 4-5 months.

Closes #3347 from akpatnam25/CELEBORN-2031-impl.

Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-07-15 17:33:00 +08:00
codenohup
0fa600ade1 [CELEBORN-2055] Fix some typos
### What changes were proposed in this pull request?
Inspired by [FLINK-38038](https://issues.apache.org/jira/projects/FLINK/issues/FLINK-38038?filter=allissues]), I used [Tongyi Lingma](https://lingma.aliyun.com/) and qwen3-thinking LLM to identify and fix some typo issues in the Celeborn codebase. For example:
- backLog → backlog
- won`t → won't
- can to be read → can be read
- mapDataPartition → mapPartitionData
- UserDefinePasswordAuthenticationProviderImpl → UserDefinedPasswordAuthenticationProviderImpl

### Why are the changes needed?
Remove typos to improve source code readability for users and ease development for developers.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Code and documentation cleanup does not require additional testing.

Closes #3356 from codenohup/fix-typo.

Authored-by: codenohup <huangxu.walker@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-07-10 12:01:02 +08:00
Gaurav Mittal
cde33d953b [CELEBORN-894] End to End Integrity Checks
### What changes were proposed in this pull request?
Design doc - https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0#heading=h.n5ldma432qnd

- End to End integrity checks provide additional confidence that Celeborn is producing complete as well as correct data
- The checks are hidden behind a client side config that is false by default. Provides users optionality to enable these if required on a per app basis
- Only compatible with Spark at the moment
- No support for Flink (can be considered in future)
- No support for Columnar Shuffle (can be considered in future)

Writer
- Whenever a mapper completes, it reports crc32 and bytes written on a per partition basis to the driver

Driver
- Driver aggregates the mapper reports - and computes aggregated CRC32 and bytes written on per partitionID basis

Reader
- Each CelebornInputStream will report (int shuffleId, int partitionId, int startMapIndex, int endMapIndex, int crc32, long bytes) to driver when it finished reading all data on the stream
- On every report
  - Driver will aggregate the CRC32 and bytesRead for the partitionID
  - Driver will aggregate mapRange to determine when all sub-paritions have been read for partitionID have been read
  - It will then compare the aggregated CRC32 and bytes read with the expected CRC32 and bytes written for the partition
  - There is special handling for skewhandlingwithoutMapRangeSplit scenario as well
  - In this case, we report the number of sub-partitions and index of the sub-partition instead of startMapIndex and endMapIndex

There is separate handling for skew handling with and without map range split

As a follow up, I will do another PR that will harden up the checks and perform additional checks to add book keeping that every CelebornInputStream makes the required checks

### Why are the changes needed?
https://issues.apache.org/jira/browse/CELEBORN-894

Note: I am putting up this PR even though some tests are failing, since I want to get some early feedback on the code changes.

### Does this PR introduce _any_ user-facing change?
Not sure how to answer this. A new client side config is available to enable the checks if required

### How was this patch tested?
Unit tests + Integration tests

Closes #3261 from gauravkm/gaurav/e2e_checks_v3.

Lead-authored-by: Gaurav Mittal <gaurav@stripe.com>
Co-authored-by: Gaurav Mittal <gauravkm@gmail.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-06-28 09:19:57 +08:00
mingji
7a0eee332a [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM
### What changes were proposed in this pull request?
1. Add a new sink and allow the user to store metrics to files.
2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM.

### Why are the changes needed?
A long-running worker ran out of memory and found out that the metrics are huge in the heap dump.
As you can see below, the biggest object is the time metric queue, and I got 1.6 million records.
<img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" />
<img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" />

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

Closes #3346 from FMX/b2045.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-26 18:42:20 -07:00
Jray
0fc7827ab8 [CELEBORN-2036] Fix NPE when TransportMessage has null payload
### What changes were proposed in this pull request?
As title.

### Why are the changes needed?
An NPE will be thrown when the ```TransportMessage``` payload is null, and there is no check here.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing ut.

Closes #3330 from Jraaay/main.

Authored-by: Jray <1075860716@qq.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-06-26 10:43:12 +08:00
caohaotian
4d4012e4c3 [CELEBORN-2040] Avoid throw FetchFailedException when GetReducerFileGroupResponse failed via broadcast
### What changes were proposed in this pull request?

In our production environment, when obtaining GetReducerFileGroupResponse via broadcast[CELEBORN-1921], failures may occur due to reasons such as:Executor preemption or local disk errors when task writing broadcast data. These scenarios throw a CelebornIOException, which is eventually converted to a FetchFailedException.

However, I think these errors are not caused by shuffle-related metadata loss, so a FetchFailedException should not be thrown to trigger a stage retry. Instead, the task should simply fail and be retried at the task level.

### Why are the changes needed?

To reduce false positive fetch failures.

case 1: During deserialization of the GetReducerFileGroupResponse broadcast, ExecutorLostFailure happend because Container was preempted,  leads to reporting a fetch failure.
```
25/06/16 08:39:21 INFO Executor task launch worker for task 30724 SparkUtils: Deserializing GetReducerFileGroupResponse broadcast for shuffle: 0
25/06/16 08:39:21 INFO Executor task launch worker for task 30724 TorrentBroadcast: Started reading broadcast variable 7 with 3 pieces (estimated total size 12.0 MiB)
......

25/06/16 08:39:21 ERROR Executor task launch worker for task 30724 SparkUtils: Failed to deserialize GetReducerFileGroupResponse for shuffle: 0
java.io.IOException: org.apache.spark.SparkException: Exception thrown in awaitResult:
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1387)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:226)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.shuffle.celeborn.SparkUtils.lambda$deserializeGetReducerFileGroupResponse$4(SparkUtils.java:600)
	at org.apache.celeborn.common.util.KeyLock.withLock(KeyLock.scala:65)
	at org.apache.spark.shuffle.celeborn.SparkUtils.deserializeGetReducerFileGroupResponse(SparkUtils.java:585)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:485)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:480)
	at org.apache.celeborn.client.ShuffleClient.deserializeReducerFileGroupResponse(ShuffleClient.java:321)
	at org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1876)
	at org.apache.celeborn.client.ShuffleClientImpl.lambda$updateFileGroup$9(ShuffleClientImpl.java:1935)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
	at org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1931)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
	at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:225)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
	at org.apache.spark.scheduler.Task.run(Task.scala:130)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:477)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1428)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:480)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:105)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:89)
	at org.apache.spark.storage.BlockManagerMaster.getLocationsAndStatus(BlockManagerMaster.scala:93)
	at org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:1179)
	at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1341)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:180)
	at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.broadcast.TorrentBroadcast.readBlocks(TorrentBroadcast.scala:169)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:231)
	at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:226)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1380)
	... 40 more
Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask739bc42 rejected from java.util.concurrent.ScheduledThreadPoolExecutor66c2c5b0[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)
	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
	at org.apache.spark.rpc.netty.NettyRpcEnv.askAbortable(NettyRpcEnv.scala:264)
	at org.apache.spark.rpc.netty.NettyRpcEndpointRef.askAbortable(NettyRpcEnv.scala:552)
	at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:556)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:104)
	... 54 more

25/06/16 08:39:21 ERROR Executor task launch worker for task 30723 ShuffleClientImpl: Exception raised while call GetReducerFileGroup for 0.
org.apache.celeborn.common.exception.CelebornIOException: Failed to get GetReducerFileGroupResponse broadcast for shuffle: 0
	at org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1878)
......

25/06/16 08:39:21 WARN Executor task launch worker for task 30724 CelebornShuffleReader: Handle fetch exceptions for 0-0
org.apache.celeborn.common.exception.CelebornIOException: Failed to load file group of shuffle 0 partition 4643! Failed to get GetReducerFileGroupResponse broadcast for shuffle: 0
	at org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1943)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
......
Caused by: org.apache.celeborn.common.exception.CelebornIOException: Failed to get GetReducerFileGroupResponse broadcast for shuffle: 0
	at org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1878)
	at org.apache.celeborn.client.ShuffleClientImpl.lambda$updateFileGroup$9(ShuffleClientImpl.java:1935)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
	at org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1931)
	... 27 more
```
case 2: During deserialization of the GetReducerFileGroupResponse broadcast, a failure to create the local directory leads to reporting a fetch failure.
```
25/05/27 07:27:03 INFO Executor task launch worker for task 20399 SparkUtils: Deserializing GetReducerFileGroupResponse broadcast for shuffle: 1
25/05/27 07:27:03 INFO Executor task launch worker for task 20399 TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
25/05/27 07:27:03 INFO Executor task launch worker for task 20399 TorrentBroadcast: Reading broadcast variable 5 took 0 ms
25/05/27 07:27:03 INFO Executor task launch worker for task 20399 MemoryStore: Block broadcast_5 stored as values in memory (estimated size 980.4 KiB, free 6.3 GiB)
25/05/27 07:27:03 WARN Executor task launch worker for task 20399 BlockManager: Putting block broadcast_5 failed due to exception java.io.IOException: Failed to create local dir in /data12/hadoop/yarn/nm-local-dir/usercache/......
25/05/27 07:27:03 WARN Executor task launch worker for task 20399 BlockManager: Block broadcast_5 was not removed normally.
25/05/27 07:27:03 ERROR Executor task launch worker for task 20399 Utils: Exception encountered
java.io.IOException: Failed to create local dir in /data12/hadoop/yarn/nm-local-dir/usercache/......
	at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:93)
	at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:114)
	at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2050)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1574)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1611)
	at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:1467)
	at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1936)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:262)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:231)
	at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
	at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:226)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1380)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:226)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at org.apache.spark.shuffle.celeborn.SparkUtils.lambda$deserializeGetReducerFileGroupResponse$4(SparkUtils.java:600)
	at org.apache.celeborn.common.util.KeyLock.withLock(KeyLock.scala:65)
	at org.apache.spark.shuffle.celeborn.SparkUtils.deserializeGetReducerFileGroupResponse(SparkUtils.java:585)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:485)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:480)
	at org.apache.celeborn.client.ShuffleClient.deserializeReducerFileGroupResponse(ShuffleClient.java:321)
	at org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1876)
	at org.apache.celeborn.client.ShuffleClientImpl.lambda$updateFileGroup$9(ShuffleClientImpl.java:1935)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
	at org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1931)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
	at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:225)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:60)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.scheduler.Task.run(Task.scala:130)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:477)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1428)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:480)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

25/05/27 07:27:03 ERROR Executor task launch worker for task 20399 ShuffleClientImpl: Exception raised while call GetReducerFileGroup for 1.
org.apache.celeborn.common.exception.CelebornIOException: Failed to get GetReducerFileGroupResponse broadcast for shuffle: 1
......

25/05/27 07:27:03 WARN Executor task launch worker for task 20399 CelebornShuffleReader: Handle fetch exceptions for 1-0
org.apache.celeborn.common.exception.CelebornIOException: Failed to load file group of shuffle 1 partition 4001! Failed to get GetReducerFileGroupResponse broadcast for shuffle: 1
	at org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1943)
	at org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
......
Caused by: org.apache.celeborn.common.exception.CelebornIOException: Failed to get GetReducerFileGroupResponse broadcast for shuffle: 1
	at org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1878)
......

```

### Does this PR introduce _any_ user-facing change?
When `ShuffleClient.deserializeReducerFileGroupResponse(shuffleId, response.broadcast())` return null, will not report a fetch failure, instead, the task will simply fail.

### How was this patch tested?
Long-running Production Validation

Closes #3341 from vastian180/CELEBORN-2040.

Lead-authored-by: caohaotian <caohaotian@meituan.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-22 23:59:29 -07:00
Wang, Fei
3d614f848e [CELEBORN-1931][FOLLOWUP] Update config version for worker local flusher gather api
### What changes were proposed in this pull request?

Update the config version to 0.5.5 for `celeborn.worker.flusher.local.gatherAPI.enabled`.

### Why are the changes needed?
Followup https://github.com/apache/celeborn/pull/3335

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

GA

Closes #3338 from turboFei/CELEBORN-1931_follow.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-17 11:54:26 -07:00
Sanskar Modi
919ece8ad2 [CELEBORN-2015][FOLLOWUP] Retry IOException failures for RPC requests
### What changes were proposed in this pull request?
Follow up PR for https://github.com/apache/celeborn/pull/3286 – Handling IOException wrapped inside CelebornException.

### Why are the changes needed?

`org.apache.celeborn.common.util.ThreadUtils#awaitResult` wraps non-timeout exception into CelebornException because of which it is not getting caught and retries are not working.

Ex – 
```
org.apache.celeborn.common.exception.CelebornRuntimeException: setupLifecycleManagerRef failed!
	at org.apache.celeborn.client.ShuffleClientImpl.setupLifecycleManagerRef(ShuffleClientImpl.java:1834)
	at org.apache.celeborn.client.ShuffleClient.get(ShuffleClient.java:89)
	at org.apache.spark.shuffle.celeborn.SparkShuffleManager.getWriter(SparkShuffleManager.java:241)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:57)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:138)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:556)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:559)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.celeborn.common.exception.CelebornException: Exception thrown in awaitResult:
	at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:320)
	at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:78)
	at org.apache.celeborn.common.rpc.RpcEnv.setupEndpointRefByAddr(RpcEnv.scala:111)
	at org.apache.celeborn.common.rpc.RpcEnv.$anonfun$setupEndpointRef$1(RpcEnv.scala:133)
	at org.apache.celeborn.common.util.Utils$.withRetryOnTimeoutOrIOException(Utils.scala:1306)
	at org.apache.celeborn.common.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:133)
	at org.apache.celeborn.client.ShuffleClientImpl.setupLifecycleManagerRef(ShuffleClientImpl.java:1828)
	... 12 more
Caused by: java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
	at org.apache.celeborn.shaded.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:255)
	at org.apache.celeborn.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at org.apache.celeborn.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
	at org.apache.celeborn.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
NA

Closes #3315 from s0nskar/ioexception.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-09 11:53:48 -07:00
lvshuang.xjs
5e305c3a5a [CELEBORN-1673][FOLLOWUP] Shouldn't ignore InterruptedException when client retry
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
Speculatively tasks will be interrupted if another task succeeds. In this case, interrupting the speculative execution can lead to client retries, and ignoring the InterruptedException might prevent the task from being killed promptly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
PASS GA

Closes #3314 from RexXiong/CELEBORN-1673-FOLLOWUP.

Authored-by: lvshuang.xjs <lvshuang.xjs@taobao.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-06 10:36:05 -07:00
Aravind Patnam
ebfa1d8cf4 [CELEBORN-2014] updateInterruptionNotice REST API
### What changes were proposed in this pull request?
This PR is part of [CIP17: Interruption Aware Slot Selection](https://cwiki.apache.org/confluence/display/CELEBORN/CIP-17%3A+Interruption+Aware+Slot+Selection).
It introduces a REST api for external services to notify master about interruptions/schedules.

### Why are the changes needed?
To nofify master of upcoming interruption notices in the worker fleet. Master can then use these to proactively deprioritize workers that might be in scope for interruption sooner.

### Does this PR introduce _any_ user-facing change?
new rest api

### How was this patch tested?
added unit tests.

Closes #3285 from akpatnam25/CELEBORN-2014.

Authored-by: Aravind Patnam <akpatnam25@gmail.com>
Signed-off-by: SteNicholas <programgeek@163.com>
2025-06-06 14:07:49 +08:00
nicolas.fraison@datadoghq.com
061cdc3820 [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload
### What changes were proposed in this pull request?

Add a retry mechanism when completing S3 multipart upload to ensure that completeMultipartUpload is retry when facing retryable exception like SlowDown one

### Why are the changes needed?

While running a “simple” spark jobs creating 10TiB of shuffle data (repartition from 100k partition to 20) the job was constantly failing when all files should be committed. relying on SOFT `celeborn.client.shuffle.partitionSplit.mode`

Despite an increase of `celeborn.storage.s3.mpu.maxRetries` up to `200`. Job was still failing due to SlowDown exception
Adding some debug logs on the retry policy from AWS S3 SDK I've seen that the policy is never called when doing completeMultipartUpload action while it is well called on other actions. See https://issues.apache.org/jira/browse/CELEBORN-2003

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Created a cluster on a kubernetes server relying on S3 storage.
Launch a 10TiB shuffle from 100000 partitions to 200 partitions with SOFT `celeborn.client.shuffle.partitionSplit.mode`
The job succeed and well display some warn logs indicating that the `completeMultipartUpload` is retried due to SlowDown:
```
bucket ******* key poc/spark-2c86663c948243d19c127e90f704a3d5/0/35-39-0 uploadId Pbaq.pp1qyLvtGbfZrMwA8RgLJ4QYanAMhmv0DvKUk0m6.GlCKdC3ICGngn7Q7iIa0Dw1h3wEn78EoogMlYgFD6.tDqiatOTbFprsNkk0qzLu9KY8YCC48pqaINcvgi8c1gQKKhsf1zZ.5Et5j40wQ-- upload failed to complete, will retry (1/10)
com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: null; Status Code: 0; Error Code: SlowDown; Request ID: RAV5MXX3B9Z3ZHTG; S3 Extended Request ID: 9Qqm3vfJVLFNY1Y3yKAobJHv7JkHQP2+v8hGSW2HYIOputAtiPdkqkY5MfD66lEzAl45m71aiPVB0f1TxTUD+upUo0NxXp6S; Proxy: null), S3 Extended Request ID: 9Qqm3vfJVLFNY1Y3yKAobJHv7JkHQP2+v8hGSW2HYIOputAtiPdkqkY5MfD66lEzAl45m71aiPVB0f1TxTUD+upUo0NxXp6S 	at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler.doEndElement(XmlResponsesSaxParser.java:1906)
```

Closes #3293 from ashangit/nfraison/CELEBORN-2003.

Authored-by: nicolas.fraison@datadoghq.com <nicolas.fraison@datadoghq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-06-06 10:15:26 +08:00
sychen
211046d0ec [CELEBORN-2025] RpcFailure Scala 2.13 serialization is incompatible
### What changes were proposed in this pull request?

### Why are the changes needed?
Spark4 uses scala 2.13, but it cannot be connected to the celeborn master compiled with scala 2.12.

For example, the first node of `celeborn.master.endpoints` configured by client is not a leader, and RpcFailure will be returned at this time.

```
https://github.com/scala/bug/issues/11207

https://users.scala-lang.org/t/serialversionuid-change-between-scala-2-12-6-and-2-12-7/3478
```

```java
java.io.InvalidClassException: org.apache.celeborn.common.rpc.netty.RpcFailure; local class incompatible: stream classdesc serialVersionUID = 2793139166962436434, local class serialVersionUID = -1724324816907181707
	at java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:598) ~[?:?]
```

```bash
scala  -cp /tmp/celeborn-client-spark-3-shaded_2.12-0.5.4.jar
```

```scala

:paste -raw

  package org.apache.celeborn {
    class Y {
        def printId = {
            val clazz = classOf[org.apache.celeborn.common.rpc.netty.RpcFailure]
            val uid = java.io.ObjectStreamClass.lookup(clazz).getSerialVersionUID
            println(s"Scala version: ${scala.util.Properties.versionNumberString}")
            println(s"serialVersionUID: $uid")
      }
    }
  }

new org.apache.celeborn.Y().printId
```

2.11
```
Scala version: 2.11.12
serialVersionUID: 2793139166962436434
```
2.12
```
Scala version: 2.12.19
serialVersionUID: 2793139166962436434
```
2.13
```
Scala version: 2.13.16
serialVersionUID: -1724324816907181707
```

### Does this PR introduce _any_ user-facing change?
If we used the master compiled with 2.13 before, it may be incompatible.

### How was this patch tested?
local test

```
Scala version: 2.13.16
serialVersionUID: 2793139166962436434
```

Closes #3309 from cxzl25/CELEBORN-2025.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-06-05 22:05:56 +08:00
Sanskar Modi
aceee64c73 [CELEBORN-2018] Support min number of workers selected for shuffle
### What changes were proposed in this pull request?
Support min number of workers to assign slots on for a shuffle.

### Why are the changes needed?

PR https://github.com/apache/celeborn/pull/3039 updated the default value of `celeborn.master.slot.assign.extraSlots` to avoid skew in shuffle stage with less number of reducers. However, it will also affect the stage with large number of reducers, thus not ideal.

We are introducing a new config `celeborn.master.slot.assign.minWorkers` which will ensure that shuffle stages with less number of reducers will not cause load imbalance on few nodes.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
NA

Closes #3297 from s0nskar/min_workers.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-01 08:23:53 -07:00
SteNicholas
68a1db1e3b [CELEBORN-2005][FOLLOWUP] Introduce ShuffleMetricGroup for numBytesIn, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics
### What changes were proposed in this pull request?

Introduce `ShuffleMetricGroup` for `numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics.

Follow up #3272.

### Why are the changes needed?

`numBytesIn`, `numBytesOut`, `numRecordsOut`, `numBytesInPerSecond`, `numBytesOutPerSecond`, `numRecordsOutPerSecond` metrics should put shuffle id into variables, which could introduce `ShuffleMetricGroup` to support. Meanwhile, #3272 would print many same logs as follows that shoud be improved:

```
2025-05-28 10:48:54,433 WARN  [flink-akka.actor.default-dispatcher-18] org.apache.flink.metrics.MetricGroup                         [] - Name collision: Group already contains a Metric with the name 'numRecordsOut'. Metric will not be reported.[11.66.62.202, taskmanager, antc4flink3980005426-taskmanager-3-70, antc4flink3980005426, [vertex-2]HashJoin(joinType=[LeftOuterJoin], where=[(f0 = f00)], select=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60], build=[right]) -> Sink: Sink(table=[default_catalog.default_database.sink], fields=[f0, f1, f2, f3, f4, f5, f6, f00, f10, f20, f30, f40, f50, f60]), 2, Shuffle, Remote, 1]
```

### Does this PR introduce _any_ user-facing change?

Introduce `celeborn.client.flink.metrics.scope.shuffle` config option to define the scope format string that is applied to all metrics scoped to a shuffle:

- Variables:
   - Shuffle: `<task_id>, <task_name>, <task_attempt_id>, <task_attempt_num>, <subtask_index>, <shuffle_id>`.
- Metrics:

Scope | Metrics | Description | Type
-- | -- | -- | --
Shuffle | numBytesIn | The total number of bytes this shuffle has read. | Counter |
Shuffle | numBytesOut| The total number of bytes this shuffle has written. | Counter |
Shuffle | numRecordsOut | The total number of records this shuffle has written. | Counter |
Shuffle | numBytesInPerSecond | The number of bytes this shuffle reads per second. | Meter |
Shuffle | numBytesOutPerSecond | The number of bytes this shuffle writes per second. | Meter |
Shuffle | numRecordsOutPerSecond | The number of records this shuffle writes per second. | Meter |

### How was this patch tested?

Manual test.

![image](https://github.com/user-attachments/assets/a10c18ab-84f9-44f5-bb2d-e6b08e5bc64e)
![image](https://github.com/user-attachments/assets/0cb29c17-3388-4608-b7a4-ee7e3c9b43c1)

Closes #3296 from SteNicholas/CELEBORN-2005.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Weijie Guo <reswqa@163.com>
2025-05-30 14:54:28 +08:00
Shuang
0227a1ab29 [CELEBORN-1627][FOLLOWUP] Fix the issue where the case of name affects the metrics dashboard
### What changes were proposed in this pull request?
Revert role name change in [CELEBORN-1627](https://github.com/apache/celeborn/pull/2777)

### Why are the changes needed?
Fix the issue where the case of name affects the metrics dashboard

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual

Closes #3299 from RexXiong/CELEBORN-1627-FOLLOWUP.

Authored-by: Shuang <lvshuang.xjs@alibaba-inc.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-29 22:40:13 -07:00
Sanskar Modi
aeac31f6f0 [CELEBORN-2009] Commit files request failure should exclude worker in LifecycleManager
### What changes were proposed in this pull request?

Exclude worker in lifecycle manager if the commit files request on workers fails with `COMMIT_FILE_EXCEPTION` or after multiple retries.

### Why are the changes needed?

If worker is under high load and not able to process request because of high CPU, we should exclude it so it will not affect the next retry to shuffle stage.

Internally, we are seeing commit file futures in worker under high load are getting timed out and next retry of the stage is again picking same servers and failing. Similarly, we are seeing continuous RpcTimeout for workers but those workers are again getting selected for next retry.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

NA

Closes #3276 from s0nskar/worker_exlude_on_commit_exception.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-27 20:16:51 -07:00
Sanskar Modi
612464c69d [CELEBORN-2015] Retry IOException failures for RPC requests
### What changes were proposed in this pull request?
- Support retry on IOException failures for RpcRequest in addition with RpcTimeoutException.
- Moved duplicate code to Utils

### Why are the changes needed?

Currently if a request fails with SocketException or IOException it does not get retried which leads to stage failures. Celeborn should retry on such connection failures.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

Closes #3286 from s0nskar/setup_lifecycle_exception.

Authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-27 07:37:40 -07:00
sychen
634343e200 [CELEBORN-2007] Reduce PartitionLocation memory usage
### What changes were proposed in this pull request?

### Why are the changes needed?
Driver may have a large number of `PartitionLocation` objects, reducing some unnecessary fields of `PartitionLocation` can reduce the memory pressure of Driver.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #3274 from cxzl25/CELEBORN-2007.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-22 17:20:58 -07:00
Jinqian Fan
f7be341948 [CELEBORN-1902] Read client throws PartitionConnectionException
### What changes were proposed in this pull request?
`org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is thrown when RemoteBufferStreamReader finds that the current exception is about connection failure.

### Why are the changes needed?

If `org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is correctly thrown to reflect connection failure, then Flink can be aware of the lost Celeborn server side nodes and be able to re-compute affected data. Otherwise, endless retries could cause Flink job failure.

This PR is to deal with exceptions like:
```
java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Tested in a Flink batch job with Celeborn.

Closes #3147 from Austinfjq/throw-Partition-Connection-Exception.

Lead-authored-by: Jinqian Fan <jinqianfan@icloud.com>
Co-authored-by: Austin Fan <jinqianfan@icloud.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-21 16:58:30 -07:00
Wang, Fei
2a847ba90e [MINOR] Change some config version
### What changes were proposed in this pull request?
Fix the false config version in https://celeborn.apache.org/docs/0.5.4/configuration/

In https://github.com/apache/celeborn/pull/3082, it fixed:
- celeborn.master.endpoints.resolver
- celeborn.client.chunk.prefetch.enabled
- celeborn.client.inputStream.creation.window

In this PR, it fixes the remaining
-  celeborn.ssl.<module>.sslHandshakeTimeoutMs

### Why are the changes needed?
Fix the false config version in https://celeborn.apache.org/docs/0.5.4/configuration/

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
GA

Closes #3269 from turboFei/config_version.

Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-21 16:39:02 -07:00
SteNicholas
46d9d63e1f [CELEBORN-1916][FOLLOWUP] Improve Aliyun OSS support
### What changes were proposed in this pull request?

Improve Aliyun OSS support including `SlotsAllocator#offerSlotsLoadAware`, `Worker#heartbeatToMaster` and `PartitionDataWriter#getStorageInfo`.

### Why are the changes needed?

There are many methods where OSS support is lacking in `SlotsAllocator#offerSlotsLoadAware`, `Worker#heartbeatToMaster` and `PartitionDataWriter#getStorageInfo`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3268 from SteNicholas/CELEBORN-1916.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-21 11:44:50 +08:00
CodingCat
0b5a09a9f7 [CELEBORN-1896] delete data from failed to fetch shuffles
### What changes were proposed in this pull request?

it's a joint work with YutingWang98

currently we have to wait for spark shuffle object gc to clean disk space occupied by celeborn shuffles

As a result, if a shuffle is failed to fetch and retried , the disk space occupied by the failed attempt cannot really be cleaned , we hit this issue internally when we have to deal with 100s of TB level shuffles in a single spark application, any hiccup in servers can double even triple the disk usage

this PR implements the mechanism to delete files from failed-to-fetch shuffles

the main idea is actually simple, it triggers clean up in LifecycleManager when it applies for a new celeborn shuffle id for a retried shuffle write stage

the tricky part is that is to avoid delete shuffle files when it is referred by multiple downstream stages: the PR introduces RunningStageManager to track the dependency between stages

### Why are the changes needed?

saving disk space

### Does this PR introduce _any_ user-facing change?

a new config

### How was this patch tested?

we manually delete some files

![image](https://github.com/user-attachments/assets/4136cd52-78b2-44e7-8244-db3c5bf9d9c4)

from the above screenshot we can see that originally we have shuffle 0, 1 and after 1 faced with chunk fetch failure, it triggers a retry of 0 (shuffle 2), but at this moment, 0 has been deleted from the workers

![image](https://github.com/user-attachments/assets/7d3b4d90-ae5a-4a54-8dec-a5005850ef0a)

in the logs, we can see that in the middle the application , the unregister shuffle request was sent for shuffle 0

Closes #3109 from CodingCat/delete_fi.

Lead-authored-by: CodingCat <zhunansjtu@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-21 11:23:11 +08:00
SteNicholas
fd715b41af [CELEBORN-1993] CelebornConf introduces celeborn.<module>.io.threads to specify number of threads used in the client thread pool
### What changes were proposed in this pull request?

`CelebornConf` introduces `celeborn.<module>.io.threads` to specify number of threads used in the client thread pool.

### Why are the changes needed?

`ShuffleClientImpl` and `FlinkShuffleClientImpl` use fixed configuration expression as `conf.getInt("celeborn." + module + ".io.threads", 8)`. Therefore, `CelebornConf` should introduce `celeborn.<module>.io.threads` to specify number of threads used in the client thread pool.

### Does this PR introduce _any_ user-facing change?

`CelebornConf` adds `celeborn.<module>.io.threads` config option.

### How was this patch tested?

No.

Closes #3245 from SteNicholas/CELEBORN-1993.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-20 17:44:38 +08:00
Wang, Fei
ec62d924c5 [CELEBORN-2000] Ignore the getReducerFileGroup timeout before shuffle stage end
### What changes were proposed in this pull request?

Ignore the getReducerFileGroup timeout before shuffle stage end.
### Why are the changes needed?

1. if the getReducerFileGroup timeout is caused by lifecycle manager commitFiles timeout(stage not ended)
2. maybe many tasks failed and would not report fetch failure
3. then it cause the spark application failed eventually.

The shuffle client should ignore the getReducerFileGroup timeout before LifeCycleManager commitFiles complete.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

UT

Closes #3263 from turboFei/is_stage_end.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-20 14:16:46 +08:00
SteNicholas
d9984c9e0e [CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application
### What changes were proposed in this pull request?

Introduce `ApplicationTotalCount` and `ApplicationFallbackCount` metric to record the total and fallback count of application.

### Why are the changes needed?

There is no any metric to record the total count of application running with celeborn shuffle and engine bulit-in shuffle and the fallback count of application. Meanwhile, the fallback of Flink shuffle is based on job granularity rather than shuffle granularity.

Follw up https://github.com/apache/celeborn/pull/3012#issuecomment-2553488532.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `DefaultMetaSystemSuiteJ#testShuffleAndApplicationCountWithFallback`
- `RatisMasterStatusSystemSuiteJ#testShuffleAndApplicationCountWithFallback`

Closes #3026 from SteNicholas/CELEBORN-1800.

Lead-authored-by: SteNicholas <programgeek@163.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-19 07:20:00 -07:00
mingji
4205f83da3 [CELEBORN-1995] Optimize memory usage for push failed batches
### What changes were proposed in this pull request?
Aggregate push failed batch for the same map ID and attempt ID.

### Why are the changes needed?
To reduce memory usage.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
GA and cluster run.

Closes #3253 from FMX/b1995.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-18 07:19:26 -07:00
Xianming Lei
d03efcbdb3 [CELEBORN-1999] OpenStreamTime should use requestId to record cost time
### What changes were proposed in this pull request?
OpenStreamTime should use requestId to record cost time instead of shuffleKey

### Why are the changes needed?
OpenStreamTime is wrong because there will be multiple OpenStream requests for the same shuffleKey in the same time period.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #3258 from leixm/CELEBORN-1999.

Authored-by: Xianming Lei <xianming.lei@shopee.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-15 01:52:14 -07:00
lijianfu03
045411ac34 [CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported
### What changes were proposed in this pull request?
for non barrier shuffle read stage, LifecycleManager#handleGetShuffleIdForApp always return appshuffleId whether fetch status is true or not.

### Why are the changes needed?

As described in [jira](https://issues.apache.org/jira/browse/CELEBORN-1855), If LifecycleManager only returns appshuffleId whose fetch status is success, the task will fail directly to "there is no finished map stage associated with", but previous fetch fail event reported may not be fatal.So just give it a chance

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #3090 from buska88/celeborn-1855.

Authored-by: lijianfu03 <lijianfu@meituan.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-05-13 16:14:03 +08:00
Sanskar Modi
a547cdaeff [CELEBORN-1974] ApplicationId as metrics label should be behind a config flag
### What changes were proposed in this pull request?

Push applicationId as metrics label only if `celeborn.metrics.worker.appLevel.enabled` is true.

### Why are the changes needed?

At Uber, We use m3 for monitoring, it tries to make a new series using all the present metrics label. Having applicationId as a metrics introduces too much cardinality in `activeconnectioncount` and we are unable to use it, while it is an useful metric with/without applicationId as label. Similarly for resourceConsumption, userIdentifier alone can be used.

### Does this PR introduce _any_ user-facing change?
Yes, changed the default config value.

### How was this patch tested?
NA

Closes #3221 from s0nskar/application_tag.

Lead-authored-by: Sanskar Modi <sanskarmodi97@gmail.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-12 21:05:45 -07:00
Nicolas Fraison
c9ca90c5ee [CELEBORN-1965] Rely on all default hadoop providers for S3 auth
### What changes were proposed in this pull request?

Support all [default hadoop provider](https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java#L563) for S3 authentication

### Why are the changes needed?

As of now celeborn only support authentication based on ACESS/SECRET key while other authentication mechanism can be required (for ex. ENV var, relying on [AWS_CONTAINER_CREDENTIALS_RELATIVE_URI](https://docs.aws.amazon.com/sdkref/latest/guide/feature-container-credentials.html))

### Does this PR introduce _any_ user-facing change?

yes, the `celeborn.storage.s3.secret.key` and `celeborn.storage.s3.access.key` are removed. In order to still provide those we should rely on the hadoop config (`celeborn.hadoop.fs.s3a.access.key` / `celeborn.hadoop.fs.s3a.secret.key `)

### How was this patch tested?

Tested on celeborn cluster deployed on kubernetes and configured to use S3 relying on `IAMInstanceCredentialsProvider`

Closes #3243 from ashangit/nfraison/CELEBORN-1965.

Lead-authored-by: Nicolas Fraison <nfraison@yahoo.fr>
Co-authored-by: nicolas.fraison@datadoghq.com <nicolas.fraison@datadoghq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-09 14:16:47 +08:00
SteNicholas
74b41bb39d [CELEBORN-1319][CELEBORN-474][FOLLOWUP] PushState uses JavaUtils#newConcurrentHashMap to speed up ConcurrentHashMap#computeIfAbsent
### What changes were proposed in this pull request?

`PushState ` uses `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.

### Why are the changes needed?

Celeborn supports JDK8, which could meet the bug mentioned in [JDK-8161372](https://bugs.openjdk.org/browse/JDK-8161372). Therefore, it's better to use `JavaUtils#newConcurrentHashMap` to speed up `ConcurrentHashMap#computeIfAbsent`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #3247 from SteNicholas/CELEBORN-1319.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-08 19:45:11 +08:00
Nicolas Fraison
54732c7b38 Update celeborn conf to add S3 in default and doc for policy
### What changes were proposed in this pull request?

Add S3 type in evict and create policies
Add S3 type in list of default evict and create policy

### Why are the changes needed?

To align with other types

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #3218 from ashangit/nfraison/doc_s3.

Authored-by: Nicolas Fraison <nfraison@yahoo.fr>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-05-08 16:52:44 +08:00