Commit Graph

1323 Commits

Author SHA1 Message Date
Jiaan Geng
cdcd30dc2f [CELEBORN-1041] Improve the implementation for get the PartitionIdPassthrough class
### What changes were proposed in this pull request?
Currently, the code of get the contractor of `PartitionIdPassthrough` is very redundant.
We should improve the implementation.

### Why are the changes needed?
Improve the implementation for get the `PartitionIdPassthrough` class

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New test cases.

Closes #1989 from beliefer/CELEBORN-1041.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-16 14:24:49 +08:00
Jiaan Geng
640007100a [CELEBORN-1019][FOLLOWUP] Remove duplicate line
### What changes were proposed in this pull request?
https://github.com/apache/incubator-celeborn/pull/1968 introduced duplicate lines of code, this is my fault.

### Why are the changes needed?
Remove duplicate line.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

Closes #1990 from beliefer/CELEBORN-1019_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-16 13:48:36 +08:00
SteNicholas
f61fe17551 [CELEBORN-987][FOLLOWUP][DOC] README#Build and sbt#System Requirements should extend to Scala 2.13 and Spark 3.5
### What changes were proposed in this pull request?

`README#Build` and `sbt#System Requirements` extends to Scala 2.13.

### Why are the changes needed?

`README#Build` and `sbt#System Requirements`should extend to Scala 2.13 to align the SBT CI test results.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

SBT CI tests.

Closes #1987 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-14 09:54:22 +08:00
onebox-li
6b3c108f6e [CELEBORN-1040] Adjust local read logs and refine createReader
### What changes were proposed in this pull request?
Adjust the local reader logs. Before, it will log local read stats in each stream clos whether it really contains local read.
And refine the CelebornInputStreamImpl#createReader code to be more clearer.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
Adjust local read logs.

### How was this patch tested?
Manual test.

Closes #1988 from onebox-li/local-dev.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 20:59:38 +08:00
onebox-li
9a90ac8b6d [CELEBORN-1036] Map task hangs at limitZeroInFlight due to duplicate onFailure called
### What changes were proposed in this pull request?
In our test jobs, we found few map tasks may hang at InFlightRequestTracker#limitZeroInFlight (both
 prepareForMergeData and mapEndInternal can occurs) when worker unexpected shutdown. We add logs to trace InFlightRequestTracker#totalInflightReqs and found this adder may become negative In the above case.

When worker suddenly shutdown, the channel connection raise exception.
If NioEventLoop.processSelectedKeys is doing read, the exceptionCaught will be called. In TransportResponseHandler#exceptionCaught will call failOutstandingRequests and each request‘s onFailure callback.
```
WARN [data-client-5-9] TransportChannelHandler: Exception in connection from /xx
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.celeborn.shaded.io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:256)
	at org.apache.celeborn.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at org.apache.celeborn.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
	at org.apache.celeborn.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)
ERROR [data-client-5-9] ShuffleClientImpl: Push data to xx failed for shuffle 0 map 11 attempt 0 partition 178 batch 634, remain revive times 5.
```
Next NioEventLoop start to `runAllTasks` in the finally block.If there is push write task, PushChannelListener.handleFailure will be called because of the closing channel. Here callback.onFailure may have a data race on `outstandingPushes`.
```
ERROR [data-client-5-9] ShuffleClientImpl: Push data to xx failed for shuffle 0 map 11 attempt 0 partition 178 batch 634, remain revive times 4.
org.apache.celeborn.common.exception.CelebornIOException: Failed to send request PUSH 1264 to /xx: org.apache.celeborn.shaded.io.netty.channel.StacklessClosedChannelException, channel will be closed
	at org.apache.celeborn.common.network.client.TransportClient$PushChannelListener.handleFailure(TransportClient.java:382)
	at org.apache.celeborn.common.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:325)
	at org.apache.celeborn.common.network.client.TransportClient$PushChannelListener.operationComplete(TransportClient.java:373)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860)
	at org.apache.celeborn.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:877)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at org.apache.celeborn.shaded.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:113)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at org.apache.celeborn.shaded.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:879)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at org.apache.celeborn.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at org.apache.celeborn.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.celeborn.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.celeborn.shaded.io.netty.channel.StacklessClosedChannelException
	at org.apache.celeborn.shaded.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
```

Duplicate callback.onFailure will lead to totalInflightReqs count exception.

Here race will not be too severe and only occur under exception situation. So I think synchronize a lock is enough to avoid race.

### Why are the changes needed?
Increase robustness.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test.

Closes #1978 from onebox-li/fix-handle-channel-failure.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 20:13:32 +08:00
sychen
dd65e74f99 [CELEBORN-983] Rename PrometheusMetric configuration
### What changes were proposed in this pull request?
Replace
```properties
celeborn.metrics.master.prometheus.host
celeborn.metrics.master.prometheus.port
celeborn.metrics.worker.prometheus.host
celeborn.metrics.worker.prometheus.port
```
With
```properties
celeborn.master.http.host
celeborn.master.http.port
celeborn.worker.http.host
celeborn.worker.http.port
```

### Why are the changes needed?
The `celeborn.master.metrics.prometheus.port` and `celeborn.metrics.worker.prometheus.port` bind port not only serve prometheus metrics, but also provide some useful API services.

https://celeborn.apache.org/docs/latest/monitoring/#rest-api

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1919 from cxzl25/CELEBORN-983.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 13:28:58 +08:00
onebox-li
2b79692585 [CELEBORN-688] Add JVM metrics grafana template
### What changes were proposed in this pull request?
Currently there is no JVM metrics grafana template, nor in grafana labs. For better use, it is necessary to add one.
According the change in #1939
This template uses two variables(instance, pool).
The layout is divided into 5 rows.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/732cff90-463c-47b5-89b8-fa8dbbf33b1e)

The panels with g1 look like below:
![image](https://github.com/apache/incubator-celeborn/assets/19429353/919b7e9e-f86a-4341-a004-7f0394e1d8b2)

JVM Memory Pools row uses replicated panel mode which panels are automatically deplicated by `pool` variables.
![image](https://github.com/apache/incubator-celeborn/assets/19429353/3bdf7a3c-d4e0-42ea-bbe0-012da55a61d1)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/8feaf9b7-156d-453e-8188-40a0399ea516)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/cba4b61c-7d66-4893-9f07-6157c64869bd)
![image](https://github.com/apache/incubator-celeborn/assets/19429353/09b473ef-434c-4fd0-aa4b-084f7588a4f7)

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Yes, this dashboard is based on changes in #1939

### How was this patch tested?
Cluster test

Closes #1964 from onebox-li/add-jvm-dashboard.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-13 11:54:49 +08:00
onebox-li
8b1bd07905
[CELEBORN-1037] Incorrect output for metrics of Prometheus
### What changes were proposed in this pull request?
The new added `deadlocks` metrics in `ThreadStatesGaugeSet` is a Set<String>, which is invalid. So here add a filter at the `addGauge` extrance.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
Remove the unused metrics. BTW the template use `metrics_jvm_thread_deadlock_count_Value`

### How was this patch tested?
Manual test

Closes #1981 from onebox-li/fix-1037.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-10-13 11:18:03 +08:00
SteNicholas
c97628c510 [CELEBORN-987][DOC] README#Build should extend to Java8/11/17
### What changes were proposed in this pull request?

`README#Build` extends to Java8/11/17. Meanwhile, the profile of maven adds `jdk-17`.

### Why are the changes needed?

`README#Build` should extend to Java8/11/17. Meanwhile, the profile of maven should add jdk-17.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Local maven compile.

Closes #1985 from SteNicholas/CELEBORN-987.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-12 21:58:32 +08:00
sychen
61fadd57bd [CELEBORN-665] Skip empty app snapshot logs
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1973 from cxzl25/CELEBORN-665.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 21:12:15 +08:00
xleoken
304b796475 [MINOR] Fix wrong description about app list
### What changes were proposed in this pull request?

Fix wrong description about app list.

### How was this patch tested?

local.

Closes #1979 from xleoken/patch2.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 20:43:03 +08:00
onebox-li
a47f6169d8 [MINOR] Fix some typos
### What changes were proposed in this pull request?
Fix some typos

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
-

Closes #1983 from onebox-li/fix-typo.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-12 20:34:07 +08:00
zml1206
74a5acfd7b [CELEBORN-1038] Clean up deprecated api
### What changes were proposed in this pull request?
Replace `org.apache.commons.io.Charsets.UTF_8` to `java.nio.charset.StandardCharsets.UTF_8`.
Replace `Assert.assertEquals` to `Assert.assertArrayEquals`.

### Why are the changes needed?

Clean up deprecated api.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Existing UT.

Closes #1980 from zml1206/CELEBORN-1038.

Authored-by: zml1206 <zhuml1206@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-12 17:14:08 +08:00
SteNicholas
438cdf6747 [CELEBORN-973] Improve HttpRequestHandler handle HTTP request with base, master and worker
### What changes were proposed in this pull request?

The code that `HttpRequestHandler` handles HTTP request could be improved with handling HTTP request with base, master and worker.

### Why are the changes needed?

Improves `HttpRequestHandler` handle HTTP request with base, master and worker.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Internal tests.

Closes #1977 from SteNicholas/http-request-handler.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-12 15:08:15 +08:00
sychen
62ba44d8da [CELEBORN-1026] Optimize registerShuffle fallback log
### What changes were proposed in this pull request?

### Why are the changes needed?
According to https://github.com/apache/incubator-celeborn/pull/1955 , when celeborn is not available, `spark.dynamicAllocation.enabled=true` and  `spark.shuffle.service. enabled=true`, shuffle data should not be lost.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1963 from cxzl25/CELEBORN-1026.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-11 20:25:44 +08:00
SteNicholas
84a318f716 [CELEBORN-1033] MasterNotLeaderException should provide the cause of exception
### What changes were proposed in this pull request?

`HAHelper#sendFailure` only sends `MasterNotLeaderException` without cause, which causes that the actual exception of `MasterNotLeaderException` could not catch for troubleshooting.

### Why are the changes needed?

`MasterNotLeaderException` provides the cause of exception.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`MasterClientSuiteJ`

Closes #1972 from SteNicholas/CELEBORN-1033.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-10-11 20:18:58 +08:00
sychen
6fa669748c [CELEBORN-999] MR deps check
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
./dev/dependencies.sh  --module mr --check
./dev/dependencies.sh  --module mr --check --sbt
```

Closes #1928 from cxzl25/CELEBORN-999.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-11 13:56:31 +08:00
sychen
9c07ceddb0 [CELEBORN-1028][FOLLOWUP][DOCS] Make prometheus path configurable
### What changes were proposed in this pull request?

### Why are the changes needed?
https://github.com/apache/incubator-celeborn/pull/1965#issuecomment-1755345813

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

<img width="1410" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/6454133a-040b-4dde-84b7-dbf08fb15b13">

<img width="1401" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/3cdfa9f2-9a7a-43cb-9006-77810a350669">

Closes #1974 from cxzl25/CELEBORN-1028-FOLLOWUP.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 22:59:22 +08:00
sychen
bcf89da7dd [MINOR] Fix typo in CelebornConf
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1971 from cxzl25/typo.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 20:04:16 +08:00
Jiaan Geng
a776e7e315 [CELEBORN-1019] Avoid magic strings copy from Spark SQLConf
### What changes were proposed in this pull request?
Avoid magic strings copy from Spark `SQLConf`.

### Why are the changes needed?
Currently, the spark integration uses many magic strings copy from Spark `SQLConf`.
Since we have already depend on Spark, references the variable from Spark is better.

### Does this PR introduce _any_ user-facing change?
No.
Just update the inner implementation.

### How was this patch tested?
Exists test cases.

Closes #1968 from beliefer/CELEBORN-1019.

Lead-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 19:52:24 +08:00
sychen
f6d27609b8 [CELEBORN-1028] Make prometheus path configurable
### What changes were proposed in this pull request?
`celeborn.metrics.prometheus.path`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1965 from cxzl25/CELEBORN-1028.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 18:37:44 +08:00
zwangsheng
50843c23e8 [CELEBORN-1032][WORKER] Use scheduleWithFixedDelay instead of scheduleAtFixedRate in worker timeout check threads pool
### What changes were proposed in this pull request?
Use `scheduleWithFixedDelay` instead of `scheduleAtFixedRate` when worker submit push/fetch timeout check task.

### Why are the changes needed?

`scheduleAtFixedRate` means
> Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given period; that is executions will commence after initialDelay then initialDelay+period, then initialDelay + 2 * period, and so on.

`scheduleWithFixedDelay` means
> Creates and executes a periodic action that becomes enabled first after the given initial delay, and subsequently with the given delay between the termination of one execution and the commencement of the next.

So in `scheduleAtFixedRate` case, we experience task stacking in DelayedWorkQueue, due to the long execution time of a single task.

Too much task remain in `DelayedWorkQueue` will cost lot memory and CPU to put or pop.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Before:
```patch
[arthas6]$ getstatic org.apache.celeborn.common.network.client.TransportResponseHandler pushTimeoutChecker
field: pushTimeoutChecker
ScheduledThreadPoolExecutor[
    continueExistingPeriodicTasksAfterShutdown=Boolean[false],
    executeExistingDelayedTasksAfterShutdown=Boolean[true],
    removeOnCancel=Boolean[true],
    sequencer=AtomicLong[912892],
    ctl=AtomicInteger[-536870908],
    COUNT_BITS=Integer[29],
    CAPACITY=Integer[536870911],
    RUNNING=Integer[-536870912],
    SHUTDOWN=Integer[0],
    STOP=Integer[536870912],
    TIDYING=Integer[1073741824],
    TERMINATED=Integer[1610612736],
+   workQueue=DelayedWorkQueue[isEmpty=false;size=157377],
    mainLock=ReentrantLock[java.util.concurrent.locks.ReentrantLock3e51a6ff[Unlocked]],
    workers=HashSet[isEmpty=false;size=4],
    termination=ConditionObject[java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject4944ebf],
    largestPoolSize=Integer[4],
    completedTaskCount=Long[0],
    threadFactory=[com.google.common.util.concurrent.ThreadFactoryBuilder$172260dff],
    handler=AbortPolicy[java.util.concurrent.ThreadPoolExecutor$AbortPolicy50f54af5],
    keepAliveTime=Long[0],
    allowCoreThreadTimeOut=Boolean[false],
    corePoolSize=Integer[4],
    maximumPoolSize=Integer[2147483647],
    defaultHandler=AbortPolicy[java.util.concurrent.ThreadPoolExecutor$AbortPolicy50f54af5],
    shutdownPerm=RuntimePermission[("java.lang.RuntimePermission" "modifyThread")],
    acc=null,
    ONLY_ONE=Boolean[true],
    $assertionsDisabled=Boolean[true],
]
```
```log
// from arthas threads command
ID      NAME                                            GROUP                   PRIORITY         STATE           %CPU            DELTA_TIME      TIME            INTERRUPTED     DAEMON
61       push-timeout-checker-1                               main                      5                 WAITING           9.24             0.018             820:55.660        false            true
57       push-timeout-checker-0                               main                      5                 WAITING           9.09             0.018             818:6.882         false            true
67       push-timeout-checker-3                               main                      5                 WAITING           8.44             0.017             816:47.813        false            true
68       fetch-timeout-checker-3                              main                      5                 WAITING           8.42             0.017             825:41.115        false            true
66       fetch-timeout-checker-2                              main                      5                 WAITING           8.18             0.016             824:17.086        false            true
58       fetch-timeout-checker-0                              main                      5                 WAITING           8.16             0.016             799:45.428        false            true
65       push-timeout-checker-2                               main                      5                 WAITING           8.15             0.016             835:18.827        false            true
62       fetch-timeout-checker-1                              main                      5                 WAITING           8.14             0.016             804:47.836        false            true
```

![popo_2023-10-10  16-12-35](https://github.com/apache/incubator-celeborn/assets/52876270/0e7401e1-ca0d-46bd-900f-af5e27382e2a)

After:
```patch
[arthas6]$ getstatic org.apache.celeborn.common.network.client.TransportResponseHandler pushTimeoutChecker
field: pushTimeoutChecker
ScheduledThreadPoolExecutor[
    continueExistingPeriodicTasksAfterShutdown=Boolean[false],
    executeExistingDelayedTasksAfterShutdown=Boolean[true],
    removeOnCancel=Boolean[true],
    sequencer=AtomicLong[151380],
    ctl=AtomicInteger[-536870908],
    COUNT_BITS=Integer[29],
    CAPACITY=Integer[536870911],
    RUNNING=Integer[-536870912],
    SHUTDOWN=Integer[0],
    STOP=Integer[536870912],
    TIDYING=Integer[1073741824],
    TERMINATED=Integer[1610612736],
+   workQueue=DelayedWorkQueue[isEmpty=false;size=447],
    mainLock=ReentrantLock[java.util.concurrent.locks.ReentrantLock7319376f[Unlocked]],
    workers=HashSet[isEmpty=false;size=4],
    termination=ConditionObject[java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject16b22de2],
    largestPoolSize=Integer[4],
    completedTaskCount=Long[0],
    threadFactory=[com.google.common.util.concurrent.ThreadFactoryBuilder$11108fdba],
    handler=AbortPolicy[java.util.concurrent.ThreadPoolExecutor$AbortPolicy358470bb],
    keepAliveTime=Long[0],
    allowCoreThreadTimeOut=Boolean[false],
    corePoolSize=Integer[4],
    maximumPoolSize=Integer[2147483647],
    defaultHandler=AbortPolicy[java.util.concurrent.ThreadPoolExecutor$AbortPolicy358470bb],
    shutdownPerm=RuntimePermission[("java.lang.RuntimePermission" "modifyThread")],
    acc=null,
    ONLY_ONE=Boolean[true],
    $assertionsDisabled=Boolean[true],
]
```

![popo_2023-10-10  16-01-12](https://github.com/apache/incubator-celeborn/assets/52876270/c6395769-b7d2-4050-9046-8aafcaa5fa6a)
![popo_2023-10-10  16-02-43](https://github.com/apache/incubator-celeborn/assets/52876270/658a8e41-afbc-498f-99d3-64aa3efcd3d9)
![popo_2023-10-10  16-03-04](https://github.com/apache/incubator-celeborn/assets/52876270/43dbe4a3-e42b-4553-b160-eff567e5d4cb)

Closes #1970 from zwangsheng/CELEBORN-1032.

Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-10 17:32:20 +08:00
SteNicholas
56276e910f [CELEBORN-1024] Thread factory should set UncaughtExceptionHandler to handle uncaught exception
### What changes were proposed in this pull request?

`batchHandleChangePartitionExecutors` could not handle uncaught exception in `ChangePartitionRequest`, which causes that the uncaught exception of the thread could not get for troubleshooting. Thread factory should set `UncaughtExceptionHandler` to handle uncaught exception.

### Why are the changes needed?

Thread factory sets `UncaughtExceptionHandler` to handle uncaught exception in `ThreadUtils`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #1962 from SteNicholas/CELEBORN-1024.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-09 20:56:40 +08:00
Fu Chen
b2412d0774 [CELEBORN-1022][TEST] Update log level from FATAL to ERROR for console output in unit tests
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

1. this is developer-friendly for debugging unit tests in IntelliJ IDEA, for example: Netty's memory leak reports are logged at the error level and won't cause unit tests to be marked as fatal.

```
23/10/09 09:57:26,422 ERROR [fetch-server-52-2] ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:403)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140)
	io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:150)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.lang.Thread.run(Thread.java:750)
```

2. this won't increase console output and affect the stability of CI.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1958 from cfmcgrady/ut-console-log-level.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 15:56:05 +08:00
Fu Chen
c4135dc1b1 [CELEBORN-980] Asynchronously delete original files to fix ReusedExchange bug
### What changes were proposed in this pull request?

The `ReusedExchange` operator has the potential to generate different types of fetch requests, including both non-range and range requests. Currently, an issue arises due to the synchronous deletion of the original file by the Celeborn worker upon completion of sorting. This issue leads to the failure of non-range requests following a range request for the same partition.

the snippets to reproduce this bug
```scala
  val sparkConf = new SparkConf().setAppName("celeborn-test").setMaster("local[2]")
    .set("spark.shuffle.manager", "org.apache.spark.shuffle.celeborn.SparkShuffleManager")
    .set(s"spark.${CelebornConf.MASTER_ENDPOINTS.key}", masterInfo._1.rpcEnv.address.toString)
    .set("spark.sql.autoBroadcastJoinThreshold", "-1")
    .set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "100")
    .set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "100")
  val spark = SparkSession.builder()
    .config(sparkConf)
    .getOrCreate()
  spark.range(0, 1000, 1, 10)
    .selectExpr("id as k1", "id as v1")
    .createOrReplaceTempView("ta")
  spark.range(0, 1000, 1, 10)
    .selectExpr("id % 1 as k21", "id % 1 as k22", "id as v2")
    .createOrReplaceTempView("tb")
  spark.range(140)
    .select(
      col("id").cast("long").as("k3"),
      concat(col("id").cast("string"), lit("a")).as("v3"))
    .createOrReplaceTempView("tc")

  spark.sql(
    """
      |SELECT *
      |FROM ta
      |LEFT JOIN tb ON ta.k1 = tb.k21
      |LEFT JOIN tc ON tb.k22 = tc.k3
      |""".stripMargin)
    .createOrReplaceTempView("v1")

  spark.sql(
    """
      |SELECT * FROM v1 WHERE v3 IS NOT NULL
      |UNION
      |SELECT * FROM v1
      |""".stripMargin)
    .collect()
```

This PR proposes a solution to address this problem. It introduces an asynchronous thread for the removal of the original file. Once the sorted file is generated for a given partition, this modification ensures that both non-range and range fetch requests will be able to and only fetch the sorted file once it is generated for a given partition.

this activity diagram of `openStream`

![openStream](https://github.com/apache/incubator-celeborn/assets/8537877/633cc5b8-e673-45a0-860e-e1f7e50c8965)

### Does this PR introduce _any_ user-facing change?

No, only bug fix

### How was this patch tested?

UT

Closes #1932 from cfmcgrady/fix-partition-sort-bug-v4.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 11:04:41 +08:00
Bowen Song
a734b8cb79 [CELEBORN-1020] Remove outdated info in README.md file
### What changes were proposed in this pull request?
The description about restart a Celeborn cluster is outdated, remove this part in README file

Closes #1957 from zgzzbws/edit-doc.

Authored-by: Bowen Song <song_bowen_work@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-09 00:11:47 +08:00
xleoken
aba139a373
[CELEBORN-1018] Fix throw exception when exec create-package.sh script
### What changes were proposed in this pull request?

when exec the release script, throw exception
```
build/make-distribution.sh --release
```

```
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[ERROR] [ERROR] Could not find the selected project in the reactor: :celeborn-client-mr-shaded_2.12
[ERROR] Could not find the selected project in the reactor: :celeborn-client-mr-shaded_2.12 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MavenExecutionException
```

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

Local tested.

### How was this patch tested?

Closes #1956 from xleoken/patch.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-10-08 11:32:49 +08:00
sychen
22f523537e [CELEBORN-1002] Add SBT MRClientProject
### What changes were proposed in this pull request?

### Why are the changes needed?

```bash
./build/make-distribution.sh --sbt-enabled -Pmr
```

```bash
./build/make-distribution.sh --sbt-enabled --release
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1930 from cxzl25/CELEBORN-1002.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-10-08 10:03:21 +08:00
mingji
95c9ccfc3e [CELEBORN-1010] Update docs about spark.shuffle.service.enabled
### What changes were proposed in this pull request?
To clarify a spark config to work with Celeborn.

### Why are the changes needed?
After some tests, I found that Spark 3.1 and newer can work with Celeborn with `spark.shuffle.service.enabled=true`.

ExternalShuffleBlockResolver won't check the shuffle manager's type since Spark 3.1 and newer.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
I tested two scenarios about this PR.
1. Check whether Spark can release the executors in time.
2. Check data correctness by running TPC-DS.
All checks are good.

Closes #1955 from FMX/CELEBORN-1010.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-08 09:15:42 +08:00
jiaoqingbo
785a035eca [CELEBORN-1017] Make checkPushTimeout and checkFetchTimeout conditions independent
### What changes were proposed in this pull request?

Make checkPushTimeout and checkFetchTimeout conditions independent

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1940 introduced a bug

![image](https://github.com/apache/incubator-celeborn/assets/14961757/8e8764d4-235d-49cf-8a87-1052ffbe4697)

The judgment logic of checkFetchTimeout is added to the judgment logic of checkPushTimeout. The two of them should be independent.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PASS GA

Closes #1954 from jiaoqingbo/1017.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-07 16:23:44 +08:00
xleoken
ccea1bda39 [MINOR] Fix integer overflow in expression
### What changes were proposed in this pull request?

Fix integer overflow in expression, the `64 * 1024 * 1024 * 1024` result is `0` not a long value.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1951 from xleoken/patch.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-30 21:43:47 +08:00
Cheng Pan
84ef527181
[CELEBORN-1007][FOLLOWUP][DOCS] Update Migration Guide
### What changes were proposed in this pull request?

Mention metrics name change in Migration Guide

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1939

### Does this PR introduce _any_ user-facing change?

Yes, docs updated.

### How was this patch tested?

Review.

Closes #1950 from pan3793/CELEBORN-1007-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 21:08:11 +08:00
sychen
5310bcaf6b
[CELEBORN-313] Add rest endpoint to show master group info
### What changes were proposed in this pull request?

<img width="1347" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/43d10bff-6878-4591-9461-889494d797f9">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

```bash
./bin/celeborn-ratis sh -Draft.rpc.type=NETTY  group info   -peers clb-1:9872,clb-2:9873,clb-3:9874
```

```
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

```bash
curl http://clb-3:9983/masterGroupInfo
```

```
====================== Master Group INFO ==============================
group id: c5196f6d-2c34-3ed3-8b8a-47bede733167
leader info: 1(clb-1:9872)

[server {
  id: "3"
  address: "clb-3:9874"
  clientAddress: "clb-3:9099"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "1"
  address: "clb-1:9872"
  clientAddress: "clb-1:9097"
  startupRole: FOLLOWER
}
commitIndex: 316
, server {
  id: "2"
  address: "clb-2:9873"
  clientAddress: "clb-2:9098"
  startupRole: FOLLOWER
}
commitIndex: 316
]
```

Closes #1946 from cxzl25/CELEBORN-313.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:08:31 +08:00
Cheng Pan
e4a60d15e4
[CELEBORN-909][FOLLOWUP][DOCS] Restore titles in migration guide
### What changes were proposed in this pull request?

Restore titles in migration guide

### Why are the changes needed?

Make title in migration guide consistent.

### Does this PR introduce _any_ user-facing change?

Yes, docs changed.

### How was this patch tested?

Pass GA.

Closes #1949 from pan3793/CELEBORN-909-followup.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 20:04:53 +08:00
Cheng Pan
ab68a4ae1b
[MINOR] Fix configuration version
### What changes were proposed in this pull request?

Change the `.version("0.3.2")` to `.version("0.3.1")`

### Why are the changes needed?

0.3.1 is not release yet.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

Closes #1948 from pan3793/minor-version.

Lead-authored-by: Cheng Pan <chengpan@apache.org>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:58:06 +08:00
Cheng Pan
78553f1418
[CELEBORN-1003] Correct the LICENSE and NOTICE for shaded client jars
### What changes were proposed in this pull request?

Correct the `LICENSE` and `NOTICE` for the following shaded client jars

- `celeborn-client-flink-1.14-shaded_2.12-<version>.jar`
- `celeborn-client-flink-1.15-shaded_2.12-<version>.jar`
- `celeborn-client-flink-1.17-shaded_2.12-<version>.jar`
- `celeborn-client-mr-shaded_2.12-<version>.jar`
- `celeborn-client-spark-2-shaded_2.11-<version>.jar`
- `celeborn-client-spark-3-shaded_2.12-<version>.jar`

### Why are the changes needed?

The `LICENSE` and `NOTICE` shipped in a jar should match the content of the jar, for shaded jars, it should acknowledge all the third-party classes that are bundled.

See more discussion at https://lists.apache.org/thread/8v4wy5o132rpsjync6465zztgjlf6h5p

For how to determine which third-party jars are bundled, take `celeborn-client-spark-3-shaded_2.12-<version>.jar` as an example, the following command performs the packaging, and we can find them out by looking at logs like `Including ... in the shaded jar`

```
build/mvn clean package -DskipTests -pl :celeborn-client-spark-3-shaded_2.12 -am -Pspark-3.3
```

```
[INFO] --- maven-shade-plugin:3.4.0:shade (default)  celeborn-client-spark-3-shaded_2.12 ---
[INFO] Including org.apache.celeborn:celeborn-client-spark-3_2.12🫙0.4.0-SNAPSHOT in the shaded jar.
[INFO] Including org.apache.celeborn:celeborn-common_2.12🫙0.4.0-SNAPSHOT in the shaded jar.
[INFO] Including org.apache.commons:commons-lang3:jar:3.12.0 in the shaded jar.
[INFO] Including io.netty:netty-all:jar:4.1.93.Final in the shaded jar.
[INFO] Including io.netty:netty-buffer:jar:4.1.93.Final in the shaded jar.
...
[INFO] Excluding org.apache.ratis:ratis-common:jar:2.5.1 from the shaded jar.
[INFO] Excluding org.apache.ratis:ratis-thirdparty-misc:jar:1.0.4 from the shaded jar.
[INFO] Excluding org.apache.ratis:ratis-proto:jar:2.5.1 from the shaded jar.
...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually.

Closes #1933 from pan3793/CELEBORN-1003.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:23:54 +08:00
sychen
7e944c1a50
[CELEBORN-1014] Output log with bound address and port
### What changes were proposed in this pull request?

### Why are the changes needed?
Make it easy for administrators to find the address of the http service bindings.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

```
23/09/28 17:10:50,465 INFO [main] HttpServer: master: HttpServer started on port 9983.
```

PR
```
23/09/28 17:28:29,797 INFO [main] HttpServer: master: HttpServer started on clb-3 with port 9983.
```

Closes #1947 from cxzl25/CELEBORN-1014.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:07:12 +08:00
sychen
16bf2aeeaa
[CELEBORN-1013] Shutdown master if initialized failed
### What changes were proposed in this pull request?
```java
23/09/28 14:48:12,512 ERROR [main] Master: Initialize master failed.
java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:461)
	at sun.nio.ch.Net.bind(Net.java:453)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
```

### Why are the changes needed?
For example, bind's http service port(`celeborn.metrics.master.prometheus.port`) port is occupied and master startup fails, but because the thread started by Raft is not a daemon, the master process still exists.

d461a01a53/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java (L283-L290)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1945 from cxzl25/CELEBORN-1013.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-28 19:02:59 +08:00
sychen
a9ed7f6a39 [CELEBORN-986] Use formatted log instead of string concat
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
GA

Closes #1941 from cxzl25/CELEBORN-986.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-28 16:29:58 +08:00
onebox-li
ef4fc51d5f [CELEBORN-1008] Adjust push/fetch timeout checker thread pool and tasks
### What changes were proposed in this pull request?
Only push/data module needs push-timeout-checker, and data module needs fetch-timeout-checker.
Here make push-timeout-checker not to be created in Master/LifeCycleManager, and fetch-timeout-checker in Worker.
The same goes for related timeout checker schedule tasks.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test

Closes #1940 from onebox-li/checker-dev.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-28 10:43:08 +08:00
onebox-li
b4dfc09dcf [CELEBORN-1007] Improve JVM metrics naming and add ThreadStates metrics
### What changes were proposed in this pull request?
Since we use codahale metrics to expose JVM metrics, the name without prefix is not clear and it‘s not easy to make a grafana template for these metrics because it adds collector name or pool name in names rather than labels.

So here I add jvm metric prefixes, remove pool info from name and obtain the pool name as labels if needed.
And add ThreadStates metrics additionally.

### Why are the changes needed?
Make jvm metrics easy to understand and get template

### Does this PR introduce _any_ user-facing change?
Yes,jvm metrics naming is changed,expose threads state additionally.

change examples like below:
For GarbageCollectorMetricSet, G1-Old-Generation.time -> jvm.gc.time{name="G1-Old-Generation"}
For MemoryUsageGaugeSet, total.init -> jvm.memory.total.init ; pools.Metaspace.usage -> jvm.memory.pools.usage{name="Metaspace"}
For BufferPoolMetricSet, direct.count -> jvm.direct.count
For ThreadStatesGaugeSet, add jvm.thread.count.

For G1, the jvm metrics exposed now:
metrics_jvm_gc_time_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588
metrics_jvm_gc_count_Value{name="G1-Young-Generation",role="Worker"} 2 1695731141588
metrics_jvm_gc_time_Value{name="G1-Young-Generation",role="Worker"} 74 1695731141588
metrics_jvm_gc_count_Value{name="G1-Old-Generation",role="Worker"} 0 1695731141588

metrics_jvm_heap_committed_Value{role="Worker"} 2109734912 1695731141588
metrics_jvm_non_heap_used_Value{role="Worker"} 47700056 1695731141588
metrics_jvm_heap_used_Value{role="Worker"} 82801184 1695731141588
metrics_jvm_total_committed_Value{role="Worker"} 2160263168 1695731141588
metrics_jvm_total_init_Value{role="Worker"} 2112290816 1695731141588
metrics_jvm_non_heap_max_Value{role="Worker"} -1 1695731141588
metrics_jvm_heap_usage_Value{role="Worker"} 0.009639326483011246 1695731141588
metrics_jvm_total_used_Value{role="Worker"} 130502480 1695731141589
metrics_jvm_heap_init_Value{role="Worker"} 2109734912 1695731141589
metrics_jvm_non_heap_committed_Value{role="Worker"} 50528256 1695731141589
metrics_jvm_non_heap_init_Value{role="Worker"} 2555904 1695731141589
metrics_jvm_non_heap_usage_Value{role="Worker"} -4.7701296E7 1695731141589
metrics_jvm_heap_max_Value{role="Worker"} 8589934592 1695731141589
metrics_jvm_total_max_Value{role="Worker"} 8589934591 1695731141589
metrics_jvm_memory_pool_used_Value{name="Code-Cache",role="Worker"} 10314368 1695731141588
metrics_jvm_memory_pool_committed_Value{name="Code-Cache",role="Worker"} 10944512 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Eden-Space",role="Worker"} 111149056 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Old-Gen",role="Worker"} 8589934592 1695731141588
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_used_Value{name="Compressed-Class-Space",role="Worker"} 4440192 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Metaspace",role="Worker"} 0.9449504192610433 1695731141588
metrics_jvm_memory_pool_max_Value{name="Metaspace",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_init_Value{name="G1-Survivor-Space",role="Worker"} 0 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141588
metrics_jvm_memory_pool_committed_Value{name="G1-Eden-Space",role="Worker"} 96468992 1695731141588
metrics_jvm_memory_pool_max_Value{name="G1-Survivor-Space",role="Worker"} -1 1695731141588
metrics_jvm_memory_pool_usage_Value{name="Compressed-Class-Space",role="Worker"} 0.004135251045227051 1695731141588
metrics_jvm_memory_pool_usage_Value{name="G1-Survivor-Space",role="Worker"} 1.0 1695731141588
metrics_jvm_memory_pool_max_Value{name="Code-Cache",role="Worker"} 251658240 1695731141588
metrics_jvm_memory_pool_init_Value{name="Compressed-Class-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Eden-Space",role="Worker"} 0.34782608695652173 1695731141589
metrics_jvm_memory_pool_init_Value{name="Metaspace",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_max_Value{name="G1-Eden-Space",role="Worker"} -1 1695731141589
metrics_jvm_memory_pool_usage_Value{name="Code-Cache",role="Worker"} 0.04098917643229167 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Eden-Space",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_init_Value{name="Code-Cache",role="Worker"} 2555904 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Survivor-Space",role="Worker"} 14680064 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Compressed-Class-Space",role="Worker"} 4718592 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Eden-Space",role="Worker"} 33554432 1695731141589
metrics_jvm_memory_pool_used_Value{name="G1-Old-Gen",role="Worker"} 34566688 1695731141589
metrics_jvm_memory_pool_usage_Value{name="G1-Old-Gen",role="Worker"} 0.004024092108011246 1695731141589
metrics_jvm_memory_pool_used_after_gc_Value{name="G1-Old-Gen",role="Worker"} 0 1695731141589
metrics_jvm_memory_pool_committed_Value{name="Metaspace",role="Worker"} 34865152 1695731141589
metrics_jvm_memory_pool_init_Value{name="G1-Old-Gen",role="Worker"} 1998585856 1695731141589
metrics_jvm_memory_pool_used_Value{name="Metaspace",role="Worker"} 32945840 1695731141589
metrics_jvm_memory_pool_max_Value{name="Compressed-Class-Space",role="Worker"} 1073741824 1695731141589

metrics_jvm_direct_count_Value{role="Worker"} 8 1695731141589
metrics_jvm_direct_capacity_Value{role="Worker"} 1036 1695731141589
metrics_jvm_direct_used_Value{role="Worker"} 1037 1695731141589
metrics_jvm_mapped_used_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_capacity_Value{role="Worker"} 0 1695731141589
metrics_jvm_mapped_count_Value{role="Worker"} 0 1695731141589

metrics_jvm_thread_timed_waiting_count_Value{role="Worker"} 23 1695731141589
metrics_jvm_thread_deadlock_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_count_Value{role="Worker"} 78 1695731141589
metrics_jvm_thread_waiting_count_Value{role="Worker"} 45 1695731141589
metrics_jvm_thread_daemon_count_Value{role="Worker"} 75 1695731141589
metrics_jvm_thread_new_count_Value{role="Worker"} 0 1695731141589
metrics_jvm_thread_blocked_count_Value{role="Worker"} 0 1695731141590
metrics_jvm_thread_deadlocks_Value{role="Worker"} [] 1695731141590
metrics_jvm_thread_runnable_count_Value{role="Worker"} 10 1695731141590
metrics_jvm_thread_terminated_count_Value{role="Worker"} 0 1695731141590

### How was this patch tested?
UT and cluster test with g1, PS-Scavenge/PS-MarkSweep and ParNew/CMS

Closes #1939 from onebox-li/improve-jvm-metrics.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-28 10:18:37 +08:00
sychen
3e515c5d2e [CELEBORN-1009][DOC] CELEBORN_PREFER_JEMALLOC
### What changes were proposed in this pull request?

![image](https://github.com/apache/incubator-celeborn/assets/3898450/e7d9e93d-6e1c-469c-98f2-835840bb0973)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1944 from cxzl25/CELEBORN-1009.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Keyong Zhou <zhouky@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-27 23:14:27 +08:00
sychen
42f08ca21a [CELEBORN-985] Change default value of numConnectionsPerPeer to 1
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1943 from cxzl25/CELEBORN-985.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-27 22:50:23 +08:00
xleoken
83a92fd1f1
[MINOR] Remove unexpected $ symbol
### What changes were proposed in this pull request?

Remove unexpected $ symbol in README doc

### Why are the changes needed?

throw error
```
bash: export: `=/opt': not a valid identifier
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1942 from xleoken/patch.

Authored-by: xleoken <leo65535@163.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-27 19:51:26 +08:00
zhouyifan279
333db39713 [CELEBORN-954] Add documentation about reliable shuffle data storage
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
Yes. A new config was added in [README.md ](https://github.com/apache/incubator-celeborn/blob/main/README.md#spark-configuration).

### How was this patch tested?

Closes #1938 from zhouyifan279/reliable-storage-doc.

Authored-by: zhouyifan279 <zhouyifan279@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-27 00:39:14 +08:00
sunjunjie
aa9dfd0566 [CELEBORN-1005][BUG] Clean Expired App Dirs will delete the running a…
### What changes were proposed in this pull request?
When working on reading shuffle data, the file was accidentally deleted

`2023-09-22 16:32:36,810 [storage-scheduler] INFO  org.apache.celeborn.service.deploy.worker.storage.StorageManager[51]: Delete expired app dir /data8/rssdata/celeborn-worker/shuffle_data/application_1689848866482_12296544_1.
2023-09-22 16:32:36,810 [Disk-cleaner-/data8-6] DEBUG org.apache.celeborn.service.deploy.worker.storage.StorageManager[47]: Deleted expired shuffle file /data8/rssdata/celeborn-worker/shuffle_data/application_1689848866482_12296544_1/32/924-0-0.
2023-09-22 16:32:53,304 [fetch-server-11-31] DEBUG org.apache.celeborn.service.deploy.worker.FetchHandler[47]: Received chunk fetch request application_1689848866482_12296544_1-32 924-0-0 0 2147483647 get file info FileInfo{file=/data8/rssdata/celeborn-worker/shuffle_data/application_1689848866482_12296544_1/32/924-0-0, chunkOffsets=0,558, userIdentifier=`default`.`default`, partitionType=REDUCE}
java.io.FileNotFoundException: /data8/rssdata/celeborn-worker/shuffle_data/application_1689848866482_12296544_1/32/924-0-0 (No such file or directory)`

Because when cleaning up the directories of expired apps, the file directory is created first and then added to the fileInfos collection. As a result, when getting the shuffleKeySet, the running apps do not yet exist, causing the files to be mistakenly deleted.

https://issues.apache.org/jira/browse/CELEBORN-1005
### Why are the changes needed?
bugfix

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1937 from wilsonjie/CELEBORN-1005.

Lead-authored-by: sunjunjie <sunjunjie@zto.com>
Co-authored-by: junjie.sun <40379361+wilsonjie@users.noreply.github.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-25 23:20:48 +08:00
Mridul Muralidharan
3a41db360b
[CELEBORN-1006] Add support for Apache Hadoop 2.x in Celeborn build
Add support for Apache Hadoop 2.x in Celeborn build
Developers need to only specify their `hadoop.version`, and the build will pick the right profile internally based on the version to add the relevant dependencies.

[hadoop-client-api](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-api) and [hadoop-client-runtime](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client-runtime) were introduced in hadoop 3.x, while hadoop 2.x had [hadoop-client](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client)
Celeborn depends on the former, and so requires hadoop 3.x to build.

Apache Spark dropped support for Hadoop 2.x only in the recent v3.5 ([SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452)). Given this, we have case where deployments on supported platforms like Spark 3.4 and older running on 2.x hadoop, will need to pull in hadoop 3.x just for Celeborn.

This PR uses `hadoop-client` when `hadoop.version` is specified as 2.x - and preserves existing behavior when `hadoop.version` is 3.x

Note - while using `hadoop-client` in 3.x is an option, hadoop community recommendation is to rely on `hadoop-client-api`/`hadoop-client-runtime`, hence making an effort to leverage that as much as possible.

Adds support for using 2.x for hadoop.version

Three combinations were tested:

* Default, without overriding hadoop.version

Dependencies:
```
$ build/mvn dependency:list 2>&1 | grep hadoop | sort | uniq
[INFO]    org.apache.hadoop:hadoop-client-api:jar:3.2.4:compile
[INFO]    org.apache.hadoop:hadoop-client-runtime:jar:3.2.4:compile
```

Will update this section again based on test suite results (which are ongoing)

* Setting hadoop.version to newer 3.3.0 explicitly

Dependencies:
```
$ ARGS="-Pspark-3.1 -Dhadoop.version=3.3.0" ; build/mvn dependency:list $ARGS 2>&1 | grep hadoop | sort | uniq
[INFO]    org.apache.hadoop:hadoop-client-api:jar:3.3.0:compile
[INFO]    org.apache.hadoop:hadoop-client-runtime:jar:3.3.0:compile
```

* Setting hadoop.version to older 2.10.0

Dependencies:
```
$ ARGS="-Pspark-3.1 -Dhadoop.version=2.10.0" ; build/mvn dependency:list $ARGS 2>&1 | grep hadoop | grep compile | sort | uniq
[INFO]    org.apache.hadoop:hadoop-auth:jar:2.10.0:compile -- module hadoop.auth (auto)
[INFO]    org.apache.hadoop:hadoop-client:jar:2.10.0:compile -- module hadoop.client (auto)
[INFO]    org.apache.hadoop:hadoop-common:jar:2.10.0:compile -- module hadoop.common (auto)
[INFO]    org.apache.hadoop:hadoop-hdfs-client:jar:2.10.0:compile -- module hadoop.hdfs.client (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.10.0:compile -- module hadoop.mapreduce.client.app (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.10.0:compile -- module hadoop.mapreduce.client.common (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.10.0:compile -- module hadoop.mapreduce.client.core (auto)
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.10.0:compile
[INFO]    org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.10.0:compile -- module hadoop.mapreduce.client.shuffle (auto)
[INFO]    org.apache.hadoop:hadoop-yarn-api:jar:2.10.0:compile -- module hadoop.yarn.api (auto)
[INFO]    org.apache.hadoop:hadoop-yarn-common:jar:2.10.0:compile -- module hadoop.yarn.common (auto)
```

For each of the case above, build/test passes for each of the `ARGS`.

Closes #1936 from mridulm/main.

Authored-by: Mridul Muralidharan <mridulatgmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-25 20:15:02 +08:00
SteNicholas
2407cae43a [CELEBORN-770][FLINK] Convert BacklogAnnouncement, BufferStreamEnd, ReadAddCredit to PB
### What changes were proposed in this pull request?

`BacklogAnnouncement`, `BufferStreamEnd`, and `ReadAddCredit` should merge to transport messages to enhance celeborn's compatibility.

### Why are the changes needed?

1. Improves celeborn's transport flexibility to change RPC.
2. Makes Compatible with 0.2 client.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `TransportFrameDecoderWithBufferSupplierSuiteJ`

Closes #1905 from SteNicholas/CELEBORN-770.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-09-25 10:44:48 +08:00
Fu Chen
c775089c4b [CELEBORN-988][FOLLOWUP] Rename config key celeborn.worker.sortPartition.lazyRemovalOfOriginalFiles.enabled
### What changes were proposed in this pull request?
1. rename config key from `celeborn.worker.sortPartition.lazyRemovalOfOriginalFiles.enabled` to `celeborn.worker.sortPartition.eagerlyRemoveOriginalFiles.enabled`
2. make this config as an internal config

### Why are the changes needed?

make the config key more clearly

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1934 from cfmcgrady/celeborn-988-followup.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-24 22:28:32 +08:00
SteNicholas
55e85055b8
[CELEBORN-771][FLINK] Convert PushDataHandShake, RegionFinish, RegionStart to PB
### What changes were proposed in this pull request?

`PushDataHandShake`, `RegionFinish`, and `RegionStart` should merge to transport messages to enhance celeborn's compatibility.

### Why are the changes needed?

1. Improves celeborn's transport flexibility to change RPC.
2. Makes Compatible with 0.2 client.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- `RemoteShuffleOutputGateSuiteJ`

Closes #1910 from SteNicholas/CELEBORN-771.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-09-22 11:36:45 +08:00