celeborn

Author	SHA1	Message	Date
zky.zhoukeyong	6a5e3ed794	[CELEBORN-812] Cleanup SendBufferPool if idle for long ### What changes were proposed in this pull request? Cleans up the pooled send buffers and push tasks if the SendBufferPool has been idle for more than `celeborn.client.push.sendbufferpool.expireTimeout`. ### Why are the changes needed? Before this PR the SendBufferPool will cache the send buffers and push tasks forever. If they are large and will not be reused in the future, it wastes memory and causes GC. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual tests. Closes #1735 from waitinfuture/812-1. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-20 00:34:55 +08:00
onebox-li	405b2801fa	[CELEBORN-810] Fix some typos and grammar ### What changes were proposed in this pull request? Fix some typos and grammar ### Why are the changes needed? Ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually test Closes #1733 from onebox-li/fix-typo. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-19 18:35:38 +08:00
Cheng Pan	0db919403e	Revert "[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean…" This reverts commit `e56a8a8bed`.	2023-07-19 15:08:45 +08:00
Fu Chen	7c6644b1a7	[CELEBORN-805] Immediate shutdown of server upon completion of unit test to prevent potential resource leakage ### What changes were proposed in this pull request? As title ### Why are the changes needed? Recently, while conducting the sbt build test, it came to my attention that certain resources such as ports and threads were not being released promptly. This pull request introduces a new method, `shutdown(graceful: Boolean)`, to the `Service` trait. When invoked by `MiniClusterFeature.shutdownMiniCluster`, it calls `worker.shutdown(graceful = false)`. This implementation aims to prevent possible memory leaks during CI processes. Before this PR the unit tests in the `client/common/master/service/worker` modules resulted in leaked ports. ``` $ jps 1138131 Jps 1130743 sbt-launch-1.9.0.jar $ netstat -lntp \| grep 1130743 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp 0 0 127.0.0.1:12345 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:41563 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:42905 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:44419 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:45025 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:44799 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:39053 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:39029 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:39475 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:40153 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:33051 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:33449 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:34073 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:35347 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:35971 0.0.0.0:* LISTEN 1130743/java tcp 0 0 0.0.0.0:36799 0.0.0.0:* LISTEN 1130743/java tcp 0 0 192.168.1.151:40775 0.0.0.0:* LISTEN 1130743/java tcp 0 0 192.168.1.151:44457 0.0.0.0:* LISTEN 1130743/java ``` After this PR: ``` $ jps 1114423 Jps 1107544 sbt-launch-1.9.0.jar $ netstat -lntp \| grep 1107544 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1727 from cfmcgrady/shutdown. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-18 13:12:51 +08:00
zky.zhoukeyong	1109e2c8f4	[CELEBORN-803][FOLLOWUP] Make ```rpcAskTimeout``` default to 60s ### What changes were proposed in this pull request? As title. ### Why are the changes needed? Timeout of ```RpcEndpointRef.ask``` is controlled by ```celeborn.rpc.askTimeout```, so we also need to increase ```celeborn.rpc.askTimeout``` to extend the timeout of commit files. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1725 from waitinfuture/803-fu. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 23:53:52 +08:00
zky.zhoukeyong	9ec223edd7	[CELEBORN-803] Increase default timeout for commit files ### What changes were proposed in this pull request? As title. ### Why are the changes needed? In 0.2.1-incubating, commit files default timeout is ```NETWORK_TIMEOUT```, which is 240s. It's more reasonable because commit files costs relatively long time. In my testing with tough disks, 30s timeout with 2 retires is not enough. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1724 from waitinfuture/803. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 22:31:36 +08:00
zky.zhoukeyong	e56a8a8bed	[CELEBORN-798] Add heartbeat from client to LifecycleManager to clean… …up client ### What changes were proposed in this pull request? Add heartbeat from client to lifecycle manager. In this PR heartbeat request contains local shuffle ids from client, lifecycle manager checks with it's local set and returns ids it doesn't know. Upon receiving response, client calls ```unregisterShuffle``` for cleanup. ### Why are the changes needed? Before this PR, client side ```unregisterShuffle``` is never called. When running TPCDS 3T with spark thriftserver without DRA, I found the Executor's heap contains 1.6 million PartitionLocation objects (and StorageInfo): ![image](https://github.com/apache/incubator-celeborn/assets/948245/43658369-7763-4511-a5b0-9b3fbdf02005) After this PR, the number of PartitionLocation objects decreases to 275 thousands ![image](https://github.com/apache/incubator-celeborn/assets/948245/45f8f849-186d-4cad-83c8-64bd6d18debc) This heartbeat can be extended in the future for other purposes, i.e. reporting client's metrics. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1719 from waitinfuture/798. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 18:14:10 +08:00
zky.zhoukeyong	95119b1e4b	[CELEBORN-799][FOLLOWUP] Fix doc of `celeborn.client.push.maxReqsInFlight.total` …Flight.total``` ### What changes were proposed in this pull request? Refer to https://github.com/apache/incubator-celeborn/pull/1720#discussion_r1265092164 ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA. Closes #1723 from waitinfuture/799-fu. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 18:01:03 +08:00
zky.zhoukeyong	4b3a47c9db	[CELEBORN-799] Limit total inflight push requests ### What changes were proposed in this pull request? As title. ### Why are the changes needed? In case where worker instances is very large, say 1000, then before this PR total memory consumed by inflight requests is 64K * 1000 * ```celeborn.client.push.maxReqsInFlight(16)``` = 1G. This PR limits total inflight push requests, as 0.2.1-incubating does. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1720 from waitinfuture/799. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-17 16:17:24 +08:00
zky.zhoukeyong	a7bbbd05c4	[CELEBORN-797] Decrease writeTime metric sampling frequency to improve perf ### What changes were proposed in this pull request? 1. Decrease writeTime metric sampling frequency to improve perf 2. Set default value of ```celeborn.<module>.push.timeoutCheck.threads``` and ```celeborn.<module>.fetch.timeoutCheck.threads``` to 4 ### Why are the changes needed? Following are test cases case 1: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 15000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 1.1T data case 2: ```spark.sparkContext.parallelize(1 to 8000, 8000).flatMap( _ => (1 to 30000000).iterator.map(num => num)).repartition(8000).count``` // shuffle 2.2T data Following are e2e time of shuffle write stage \|\|Sort pusher before\|Sort pusher after\|Hash pusher before\|Hash pusher after\| \|----\|----\|----\|----\|-----\| \|case1\|4.4min\|4.1min\|4.4min\|3.9min\| \|case2\|9.1min\|8.4min\|9.7min\|8.5min\| ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA and manual test. Closes #1718 from waitinfuture/797. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-14 20:51:50 +08:00
mingji	a4687716d2	[CELEBORN-791] Remove slots allocation simulation in master and use active slots sent from worker's heartbeat ### What changes were proposed in this pull request? Master won't simulate slots allocations and use active slots sent from worker. ### Why are the changes needed? I have observed that a new worker might allocate more slots than other workers when using the round-robin slot allocation algorithm. There is a logic error in processing heartbeat from worker. It will update disk info's active slots to max(current disk info active slots, disk info sent from worker active slots). If I registered a huge shuffle, master will allocate more slots than a disk's max slots and mark them as unknown disk slots but worker will count the unknown disk slots as active slots and report it to the master. Then the slots release logic can not distinguish unknown slots from a number so the release will not decrease active slots properly. Due to the gap between work and master, so I think it's OK to remove slots allocation simulation from worker and use active slots from worker. Before this patch: <img width="928" alt="截屏2023-07-12 16 51 15" src="https://github.com/apache/incubator-celeborn/assets/4150993/9c8a46d9-26a8-42f5-a956-938273277c9b"> After this patch: <img width="509" alt="截屏2023-07-12 16 25 52" src="https://github.com/apache/incubator-celeborn/assets/4150993/c49b3d91-14ea-4eb8-9b71-9aab73541faf"> ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1710 from FMX/CELEBORN-791. Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-14 20:40:55 +08:00
zky.zhoukeyong	dcf6be29d8	[CELEBORN-789] Increase default value of flushBuffer's max components ### What changes were proposed in this pull request? Set default value of ```celeborn.worker.push.compositeBuffer.maxComponents``` to 256, to be aligned with 0.2.1-incubating version. ### Why are the changes needed? Default 16 is too small, and causes ~~severe GC~~ and CPU high load. <img width="1719" alt="image" src="https://github.com/apache/incubator-celeborn/assets/26535726/9ab9675e-c19e-44f1-af46-90c29dc4df75"> ### Does this PR introduce _any_ user-facing change? No, it's internal config. ### How was this patch tested? Passes GA. Closes #1707 from waitinfuture/789. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-12 20:18:48 +08:00
Angerszhuuuu	1642090f9f	[CELEBORN-781] Refactor RPC message type name ### What changes were proposed in this pull request? After https://github.com/apache/incubator-celeborn/pull/1658 merged, we can format the message type now. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1696 from AngersZhuuuu/CELEBORN-731. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-12 14:16:19 +08:00
Angerszhuuuu	b4dfb0352b	[CELEBORN-733] Clean unused GetBlacklist & GetBlacklistResponse ### What changes were proposed in this pull request? Clean unused GetBlacklist & GetBlacklistResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1656 from AngersZhuuuu/CELEBORN-733. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-11 15:58:20 +08:00
caojiaqing	d64e0091f1	[CELEBORN-785] Add worker side partition hard split threshold ### What changes were proposed in this pull request? Add a configuration `celeborn.worker.shuffle.partitionSplit.max` to ensure that, in soft mode, individual partition files are limited to a size smaller than the configured value ### Why are the changes needed? In soft mode, there may be situations where individual partition files are exceptionally large, which can result in excessively long sort times in skewed scenarios. ### Does this PR introduce _any_ user-facing change? `celeborn.worker.shuffle.partitionSplit.max` defalut value 2g ### How was this patch tested? none Closes #1701 from JQ-Cao/785. Authored-by: caojiaqing <caojiaqing@bilibili.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 14:14:41 +08:00
zky.zhoukeyong	7a47fae230	[CELEBORN-786] Change default flush threads ### What changes were proposed in this pull request? This PR changes default values of the following configs: \|config\|previous default value\|new default value\| \|----\|----\|----\| \|celeborn.worker.flusher.threads\|2\|16\| \|celeborn.worker.flusher.ssd.threads\|8\|16\| ### Why are the changes needed? If disk type is not specified, ```celeborn.worker.flusher.threads``` will be used. Recently many users use SSD for Celeborn workers without specifying disk type, and 2 flush threads is far from leveraging the power of SSD. ### Does this PR introduce _any_ user-facing change? Yes, default configs are changed. ### How was this patch tested? Passes GA. Closes #1703 from waitinfuture/786. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-11 13:09:29 +08:00
Angerszhuuuu	9f09ac6ce9	[CELEBORN-780] Change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default to Int.MaxValue since slot's is not a bottleneck ### What changes were proposed in this pull request? Now slots is not a bottleneck, change SPARK_SHUFFLE_FORCE_FALLBACK_PARTITION_THRESHOLD default value to Int.MaxValue. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1695 from AngersZhuuuu/CELEBORN-780. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-10 18:50:10 +08:00
zky.zhoukeyong	334f0e5ba4	[CELEBORN-782] Make max components configurable for FileWriter#flushBuffer ### What changes were proposed in this pull request? Make max components configurable for FileWriter#flushBuffer. ### Why are the changes needed? When max components of ```CompositeByteBuf``` is too big (hard coded 256 before this PR), netty's offheap memory usage will be several times bigger than true usage: ``` Direct memory usage: 1044.0 MiB/4.0 GiB, disk buffer size: 255.9 MiB ``` When set to 1, netty's memory usage will be very close to disk buffer: ``` Direct memory usage: 376.0 MiB/4.0 GiB, disk buffer size: 353.0 MiB ``` but when it set too low, for example 1, performance might decrease, especially for sort pusher: ``` max components: 1 vs. 16 shuffle write time: 34s vs. 30s ``` For hash pusher, difference is not so big: ``` max components: 1 vs. 8 shuffle write time: 25s vs. 23s ``` This PR makes the parameter configurable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passes GA, and manual test. Closes #1697 from waitinfuture/782. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-10 18:00:07 +08:00
Cheng Pan	a556b02bc1	[CELEBORN-747][FOLLOWUP] Fix nanoDurationToString and add UT ### What changes were proposed in this pull request? Sorry, the last added `nanoDurationToString` is not correct, this PR fixes it and adds UT to verify. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No, unreleased change. ### How was this patch tested? UT. Closes #1688 from pan3793/CELEBORN-747-followup2. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-06 19:34:35 +08:00
Cheng Pan	ed035d7ab0	[CELEBORN-747][FOLLOWUP] avgFlushTime and avgFetchTime are in nano seconds <!-- Thanks for sending a pull request! Here are some tips for you: - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'. - Be sure to keep the PR description updated to reflect all changes. - Please write your PR title to summarize what this PR proposes. - If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? `avgFlushTime` and `avgFetchTime` are in nano seconds, it was accidentally formatted by `msDurationToString` and caused unreasonable logs. ``` usableSpace: 1602.1 GiB, avgFlushTime: 1.35 h, avgFetchTime: 1187.66 h ``` ### Why are the changes needed? Fix logs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT is updated Closes #1687 from pan3793/CELEBORN-747-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-06 17:59:49 +08:00
Cheng Pan	4f8e72f217	[CELEBORN-774] Pullout celeborn.rpc.dispatcher.threads to CelebornConf ### What changes were proposed in this pull request? Pullout hardcoded `celeborn.rpc.dispatcher.numThreads` to `CelebornConf` and rename it to `celeborn.rpc.dispatcher.threads` to align with existing configuration style ### Why are the changes needed? Pullout inline configuration to `CelebornConf`, and expose it in configuration docs ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1684 from pan3793/CELEBORN-774. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-06 16:23:32 +08:00
zky.zhoukeyong	09881f5cff	[CELEBORN-769] Change default value of celeborn.client.push.maxReqsInFlight to 16 …Flight to 16 ### What changes were proposed in this pull request? Change default value of celeborn.client.push.maxReqsInFlight to 16. ### Why are the changes needed? Previous value 4 is too small, 16 is more reasonable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1683 from waitinfuture/769. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-06 10:22:06 +08:00
mingji	d0ecf83fec	[CELEBORN-764] Fix celeborn on HDFS might clean using app directories ### What changes were proposed in this pull request? Make Celeborn leader clean expired app dirs on HDFS when an application is Lost. ### Why are the changes needed? If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories. This will cause using app directories to be deleted unexpectedly. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and cluster. Closes #1678 from FMX/CELEBORN-764. Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 23:11:50 +08:00
zky.zhoukeyong	4300835363	[CELEBORN-768] Change default config values for batch rpcs and netty … …memory allocator ### What changes were proposed in this pull request? Changes the following configs' default values \| config \| previous value \| current value \| \| ------------- \| ------------- \| ------------- \| \| celeborn.network.memory.allocator.share \| false \| true \| \| celeborn.client.shuffle.batchHandleChangePartition.enabled \| false \| true \| \| celeborn.client.shuffle.batchHandleCommitPartition.enabled \| false \| true \| ### Why are the changes needed? In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled. In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout. ### Does this PR introduce _any_ user-facing change? No, these conifgs are introduces from 0.3.0. ### How was this patch tested? Passes GA. Closes #1682 from waitinfuture/768. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 18:16:41 +08:00
Fu Chen	3af5c231c7	[CELEBORN-767][DOC] Update the docs of `celeborn.client.spark.push.sort.memory.threshold` ### What changes were proposed in this pull request? As title ### Why are the changes needed? To clarify the usage of conf `celeborn.client.spark.push.sort.memory.threshold` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1680 from cfmcgrady/docs. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-05 18:07:09 +08:00
Angerszhuuuu	693172d0bd	[CELEBORN-751] Rename remain rss related class name and filenames etc ### What changes were proposed in this pull request? Rename remain rss related class name and filenames etc... ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1664 from AngersZhuuuu/CELEBORN-751. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-07-04 10:20:08 +08:00
Cheng Pan	a1be02b4fa	[CELEBORN-757] Improve metrics method signature and code style ### What changes were proposed in this pull request? - gauge method definition improvement. i.e. before ``` def addGauge[T](name: String, f: Unit => T, labels: Map[String, String]) ``` after ``` def addGauge[T](name: String, labels: Map[String, String])(f: () => T) ``` which improves the caller-side code style ``` addGauge(name, labels) { () => ... } ``` - remove unnecessary Java/Scala collection conversion. Since Scala 2.11 does not support SAM, the extra implicit function is required. - leverage Logging to defer message evaluation - UPPER_CASE string constants ### Why are the changes needed? Improve code quality and performance(maybe) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1670 from pan3793/CELEBORN-757. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-03 11:56:43 +08:00
Angerszhuuuu	7880c52fff	[CELEBORN-745] Match TransportMessage type use number instead of enum ### What changes were proposed in this pull request? Match TransportMessage type use number not enum to support change MessageType name，after this pr, then we can change the MessageType name. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1658 from AngersZhuuuu/CELEBORN-745. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Shuang <lvshuang.tb@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-07-01 18:50:02 +08:00
xiyu.zk	381165d4e7	[CELEBORN-755] Support disable shuffle compression ### What changes were proposed in this pull request? Support to decide whether to compress shuffle data through configuration. ### Why are the changes needed? Currently, Celeborn compresses all shuffle data, but for example, the shuffle data of Gluten has already been compressed. In this case, no additional compression is required. Therefore, configuration needs to be provided for users to decide whether to use Celeborn’s compression according to the actual situation. ### Does this PR introduce _any_ user-facing change? no. Closes #1669 from kerwin-zk/celeborn-755. Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-07-01 00:03:50 +08:00
Fu Chen	047e90b17b	[CELEBORN-756][WORKER] Refactor `PushDataHandler` class to utilize `while` loop ### What changes were proposed in this pull request? as title ### Why are the changes needed? per https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex, use `while` loop for performance-sensitive code worker's flame graph before: ![截屏2023-06-30 下午5 58 02](https://github.com/apache/incubator-celeborn/assets/8537877/28c199b6-a29b-4501-8064-e0f2ddb2a8b9) after: ![截屏2023-06-30 下午5 58 18](https://github.com/apache/incubator-celeborn/assets/8537877/c6134959-5f78-436b-aa29-a78882b09e84) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #1668 from cfmcgrady/while-loop-2. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-30 18:13:53 +08:00
Angerszhuuuu	5c7ecb8302	[CELEBORN-754][IMPORTANT] Provide a new SparkShuffleManager to replace RssShuffleManager in the future ### What changes were proposed in this pull request? Provide a new SparkShuffleManager to replace RssShuffleManager in the future ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1667 from AngersZhuuuu/CELEBORN-754. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-30 17:27:33 +08:00
Cheng Pan	df3fc194fb	[CELEBORN-744] Add Benchmark framework and ComputeIfAbsentBenchmark ### What changes were proposed in this pull request? The benchmark shows that `computeIfAbsent` still has better performance on simple case ``` ================================================================================================ HashMap ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1 Apple M1 Pro HashMap: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ putIfAbsent 701 702 0 95.7 10.4 1.0X computeIfAbsent 534 535 1 125.6 8.0 1.3X ================================================================================================ ConcurrentHashMap ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Mac OS X 13.4.1 Apple M1 Pro ConcurrentHashMap: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ putIfAbsent 712 716 3 94.2 10.6 1.0X computeIfAbsent 702 705 2 95.6 10.5 1.0X ``` ### Why are the changes needed? Introduce a Benchmark framework for future performance sensitive case measurement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1657 from pan3793/CELEBORN-744. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 20:19:30 +08:00
Fu Chen	adbd38a926	[CELEBORN-726][FOLLOWUP] Update data replication terminology from `master/slave` to `primary/replica` in the codebase ### What changes were proposed in this pull request? As title ### Why are the changes needed? In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #1639 from cfmcgrady/primary-replica. Lead-authored-by: Fu Chen <cfmcgrady@gmail.com> Co-authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 17:07:26 +08:00
Angerszhuuuu	1fd8833756	[CELEBORN-748] Rename RssHARetryClient to MasterClient ### What changes were proposed in this pull request? Rename RssHARetryClient to MasterClient ### Why are the changes needed? Code refactor ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1661 from AngersZhuuuu/CELEBORN-748. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 16:47:15 +08:00
Cheng Pan	11569689be	[CELEBORN-747] Print time/bytes in human-readable format ### What changes were proposed in this pull request? Print time/bytes in human-readable format ### Why are the changes needed? Make logs readable ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #1659 from pan3793/minor. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 16:46:12 +08:00
cchung100m	4e1e97b2c2	[CELEBORN-478] Replace putIfAbsent with computeIfAbsent in ConcurrentHashMap in Java code ### What changes were proposed in this pull request? Replace `putIfAbsent` with computeIfAbsent in ConcurrentHashMap ### Why are the changes needed? The invoking of `putIfAbsent` will always call its value if it's a time-consuming operation. So we'd better replace `putIfAbsent` with `computeIfAbsent` in some critical paths. ### Does this PR introduce _any_ user-facing change? No, it does not affect the user-facing API ### How was this patch tested? current UT Closes #1567 from cchung100m/CELEBORN-478. Lead-authored-by: cchung100m <cchung100m@cs.ccu.edu.tw> Co-authored-by: Cheng Pan <chengpan@apache.org> Co-authored-by: Neo Chien <cchung100m@cs.ccu.edu.tw> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 16:44:18 +08:00
Angerszhuuuu	1a53db22ce	[CELEBORN-739] Rename HeartbeatResponse to HeartbeatFromWorkerResponse ### What changes were proposed in this pull request? Rename HeartbeatResponse to HeartbeatFromWorkerResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1651 from AngersZhuuuu/CELEBORN-739. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 13:08:08 +08:00
Cheng Pan	c33aabfa37	[CELEBORN-740] Remove usage of AccessController ### What changes were proposed in this pull request? Remove usage of deprecated `java.security.AccessController` ### Why are the changes needed? `AccessController` is deprecated for removal since Java 17 https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/security/AccessController.html Recover building for Java 17 ``` [INFO] compiling 72 Scala sources and 209 Java sources to /home/runner/work/incubator-celeborn/incubator-celeborn/common/target/classes ... Error: /home/runner/work/incubator-celeborn/incubator-celeborn/common/src/main/scala/org/apache/celeborn/common/serializer/SerializationDebugger.scala:71: class AccessController in package security is deprecated Error: [ERROR] one error found ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` scala> System.getProperty("java.version") res0: String = 1.8.0_332 scala> System.getProperty("sun.io.serialization.extendedDebugInfo") res1: String = null scala> java.lang.Boolean.getBoolean("sun.io.serialization.extendedDebugInfo") res2: Boolean = false ``` Closes #1652 from pan3793/CELEBORN-740. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-29 10:21:02 +08:00
Fu Chen	17c1e01874	[CELEBORN-726] Update data replication terminology from `master/slave` to `primary/replica` for configurations and metrics ### What changes were proposed in this pull request? This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC. ### Why are the changes needed? In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests. Closes #1650 from cfmcgrady/primary-replica-metrics. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-29 09:47:02 +08:00
Angerszhuuuu	4c4e18b0d6	[CELEBORN-735] Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse ### What changes were proposed in this pull request? Remove unused RPC GetWorkerInfo & GetWorkerInfosResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1647 from AngersZhuuuu/CELEBORN-735. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-28 20:17:56 +08:00
Angerszhuuuu	a672db719a	[CELEBORN-734] Remove unused RPC ReregisterWorkerResonse ### What changes were proposed in this pull request? Remove unused RPC ReregisterWorkerResonse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1646 from AngersZhuuuu/CELEBORN-734. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-28 19:59:53 +08:00
Angerszhuuuu	590198ecea	[CELEBORN-666][FOLLOWUP] Rename all RPC blacklist fields ### What changes were proposed in this pull request? In this pr, we rename all RPC blacklist fields, it won't have have compatibility issues. For RPC `GetBlacklist` and `GetBlacklistResponse` we won't change it, since it won't be used in next release, so we can remove these two RPC in next release. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1643 from AngersZhuuuu/CELEBORN-666-RPC. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 19:49:44 +08:00
Angerszhuuuu	ad13b04f2e	[CELEBORN-732] Remove unused RPC ThreadDump & ThreadDumpResponse ### What changes were proposed in this pull request? Remove unused RPC ThreadDump & ThreadDumpResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1645 from AngersZhuuuu/CELEBORN-732. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 19:43:39 +08:00
Angerszhuuuu	63f22342e9	[CELEBORN-730] Remove unused SlaveLostResponse ### What changes were proposed in this pull request? Remove unused SlaveLostResponse ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1644 from AngersZhuuuu/CELEBORN-730. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 19:35:23 +08:00
onebox-li	1b74d85fb1	[CELEBORN-725][MINOR] Refine congestion code ### What changes were proposed in this pull request? Refine the congestion relevant code/log/comments ### Why are the changes needed? ditto ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually test Closes #1637 from onebox-li/improve-congestion. Authored-by: onebox-li <lyh-36@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 18:31:40 +08:00
Cheng Pan	3d7c1fa0ae	[CELEBORN-729] Fix typo PbRegisterShuffle#numMappers ### What changes were proposed in this pull request? Fix typo `numMapppers`, should be `numMappers` ### Why are the changes needed? Fix typo ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Protobuf serde depends on message field seq no, not name. Closes #1642 from pan3793/CELEBORN-729. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 18:28:34 +08:00
Angerszhuuuu	3985a5cbd7	[CELEBORN-666][FOLLOWUP] Unify all blacklist related code and comment ### What changes were proposed in this pull request? Unify all blacklist related code and comment ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1638 from AngersZhuuuu/CELEBORN-666-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>	2023-06-28 16:28:03 +08:00
zhongqiang.czq	374d735ae5	[CELEBORN-724] Fix the compatibility of HeartbeatFromApplicationRespo… …nse with lower versions ### What changes were proposed in this pull request? The master side will check HeartbeatFromApplication's reply field. if reply is true then it replies HeartbeatFromApplicationResponse otherwise OneWayMessageResponse. The reply field is default false before the version 0.2.1, so master can be compatible with older client version ### Why are the changes needed? Before the version `0.2.1`, the response of HeartbeatFromApplication is` OneWayMessageResponse`, but from `0.3.0`, the response of HeartbeatFromApplication is modified to `HeartbeatFromApplicationResponse`. if the version of `client side `is `0.2.1` and the version of `server side is 0.3.0`, the `compatiblity issue `will occur. The following compatiblity error will be printted. ``` java java.io.InvalidObjectException: enum constant HEARTBEAT_FROM_APPLICATION_RESPONSE does not exist in class org.apache.celeborn.common.protocol.MessageType at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:2157) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1662) ~[?:1.8.0_362] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2430) ~[?:1.8.0_362] at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2354) ~[?:1.8.0_362] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2212) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) ~[?:1.8.0_362] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) ~[?:1.8.0_362] at org.apache.celeborn.common.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?] ``` ``` java Caused by: java.lang.ClassCastException: Cannot cast org.apache.celeborn.common.protocol.message.ControlMessages$HeartbeatFromApplicationResponse to org.apache.celeborn.common.protocol.message.ControlMessages$OneWayMessageResponse$ at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_362] at scala.concurrent.Future.$anonfun$mapTo$1(Future.scala:500) ~[scala-library-2.12.15.jar:?] at scala.util.Success.$anonfun$map$1(Try.scala:255) ~[scala-library-2.12.15.jar:?] at scala.util.Success.map(Try.scala:213) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?] at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Promise.trySuccess(Promise.scala:94) ~[scala-library-2.12.15.jar:?] at scala.concurrent.Promise.trySuccess$(Promise.scala:94) ~[scala-library-2.12.15.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187) ~[scala-library-2.12.15.jar:?] at org.apache.celeborn.common.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:218) ~[celeborn-client-spark-3-shaded_2.12-0.2.1-incubating.jar:?] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The pr is tested manually and the testing process is as follows: 1. server side is deploy using the code of latest branch-0.3. 2. spark client is deploy the version of 0.2.1, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted 3. spark client is deploy the version of 0.3.0, then run spark-sql to execute 3 tpcds queries( query1.sql/querey2/quere3.sql whose datasize is 1T), finnally verify that the queries are executed successfully and no above compatiblity error printted This patch had conflicts when merged, resolved by Committer: Cheng Pan <chengpan@apache.org> Closes #1635 from zhongqiangczq/heartbeat2. Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com> Signed-off-by: Cheng Pan <chengpan@apache.org>	2023-06-28 16:04:18 +08:00
Angerszhuuuu	33cf343d20	[CELEBORN-666][REFACTOR] Unify exclude and blacklist related configuration ### What changes were proposed in this pull request? Unify exclude and blacklist related configuration ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #1633 from AngersZhuuuu/CELEBORN-666-NEW. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>	2023-06-28 10:59:58 +08:00
zky.zhoukeyong	57b0e815cf	[CELEBORN-656] Batch revive RPCs in client to avoid too many requests ### What changes were proposed in this pull request? This PR batches revive requests and periodically send to LifecycleManager to reduce number or RPC requests. To be more detailed. This PR changes Revive message to support multiple unique partitions, and also passes a set unique mapIds for checking MapEnd. Each time ShuffleClientImpl wants to revive, it adds a ReviveRquest to ReviveManager and wait for result. ReviveManager batches revive requests and periodically send to LifecycleManager (deduplicated by partitionId). LifecycleManager constructs ChangeLocationsCallContext and after all locations are notified, it replies to ShuffleClientImpl. ### Why are the changes needed? In my test 3T TPCDS q23a with 3 Celeborn workers, when kill a worker, the LifecycleManger will receive 4.8w Revive requests: ``` [emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out.1 \|grep -i revive \|wc -l 64364 ``` After this PR, number of ReviveBatch requests reduces to 708: ``` [emr-usermaster-1-1 logs]$ cat spark-emr-user-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-master-1-1.c-fa08904e94c028d1.out \|grep -i revive \|wc -l 2573 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. I have tested: 1. Disable graceful shutdown, kill one worker, job succeeds 2. Disable graceful shutdown, kill two workers successively, job fails as expected 3. Enable graceful shutdown, restart two workers successively, job succeeds 4. Enable graceful shutdown, restart two workers successively, then kill the third one, job succeeds Closes #1588 from waitinfuture/656-2. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: Keyong Zhou <zhouky@apache.org> Co-authored-by: Keyong Zhou <waitinfuture@gmail.com> Signed-off-by: Shuang <lvshuang.tb@gmail.com>	2023-06-27 22:11:04 +08:00

1 2 3 4 5 ...

493 Commits