celeborn/network.md at fe623888bf21dcecd662df3feafa3a19082e7ae3

Angerszhuuuu 92704c7d06 [CELEBORN-1051] Add isDynamic property for CelebornConf

### What changes were proposed in this pull request?
Since we support ConfigService, many configuration can be dynamic, add `isDynamic` property for CelebornConf in this pr.

### Why are the changes needed?
Make configuration doc more cleear

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #2308 from AngersZhuuuu/CELEBORN-1051.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Angerszhuuuu <angers.zhu@gmail.com>

2024-02-20 14:20:44 +08:00

11 KiB

Raw Blame History

license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

license

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Key	Default	isDynamic	Description	Since	Deprecated
celeborn.<module>.fetch.timeoutCheck.interval	5s	false	Interval for checking fetch data timeout. It only support setting to `data` since it works for shuffle client fetch data.	0.3.0
celeborn.<module>.fetch.timeoutCheck.threads	4	false	Threads num for checking fetch data timeout. It only support setting to `data` since it works for shuffle client fetch data.	0.3.0
celeborn.<module>.heartbeat.interval	60s	false	The heartbeat interval between worker and client. If setting to `rpc`, it works for shuffle client. If setting to `data`, it works for shuffle client push and fetch data. If setting to `replicate`, it works for replicate client of worker replicating data to peer worker.If you are using the "celeborn.client.heartbeat.interval", please use the new configs for each module according to your needs or replace it with "celeborn.rpc.heartbeat.interval", "celeborn.data.heartbeat.interval" and"celeborn.replicate.heartbeat.interval".	0.3.0	celeborn.client.heartbeat.interval
celeborn.<module>.io.backLog	0	false	Requested maximum length of the queue of incoming connections. Default 0 for no backlog. If setting to `rpc`, it works for master or worker. If setting to `push`, it works for worker receiving push data. If setting to `replicate`, it works for replicate server of worker replicating data to peer worker. If setting to `fetch`, it works for worker fetch server.
celeborn.<module>.io.clientThreads	0	false	Number of threads used in the client thread pool. Default to 0, which is 2x#cores. If setting to `rpc`, it works for shuffle client. If setting to `data`, it works for shuffle client push and fetch data. If setting to `replicate`, it works for replicate client of worker replicating data to peer worker.
celeborn.<module>.io.connectTimeout	<value of celeborn.network.connect.timeout>	false	Socket connect timeout. If setting to `rpc`, it works for shuffle client. If setting to `data`, it works for shuffle client push and fetch data. If setting to `replicate`, it works for the replicate client of worker replicating data to peer worker.
celeborn.<module>.io.connectionTimeout	<value of celeborn.network.timeout>	false	Connection active timeout. If setting to `rpc`, it works for shuffle client, master or worker. If setting to `data`, it works for shuffle client push and fetch data. If setting to `push`, it works for worker receiving push data. If setting to `replicate`, it works for replicate server or client of worker replicating data to peer worker. If setting to `fetch`, it works for worker fetch server.
celeborn.<module>.io.enableVerboseMetrics	false	false	Whether to track Netty memory detailed metrics. If true, the detailed metrics of Netty PoolByteBufAllocator will be gotten, otherwise only general memory usage will be tracked.
celeborn.<module>.io.lazyFD	true	false	Whether to initialize FileDescriptor lazily or not. If true, file descriptors are created only when data is going to be transferred. This can reduce the number of open files. If setting to `fetch`, it works for worker fetch server.
celeborn.<module>.io.maxRetries	3	false	Max number of times we will try IO exceptions (such as connection timeouts) per request. If set to 0, we will not do any retries. If setting to `push`, it works for Flink shuffle client push data.
celeborn.<module>.io.mode	NIO	false	Netty EventLoopGroup backend, available options: NIO, EPOLL.
celeborn.<module>.io.numConnectionsPerPeer	1	false	Number of concurrent connections between two nodes. If setting to `rpc`, it works for shuffle client. If setting to `data`, it works for shuffle client push and fetch data. If setting to `replicate`, it works for replicate client of worker replicating data to peer worker.
celeborn.<module>.io.preferDirectBufs	true	false	If true, we will prefer allocating off-heap byte buffers within Netty. If setting to `rpc`, it works for shuffle client, master or worker. If setting to `data`, it works for shuffle client push and fetch data. If setting to `push`, it works for worker receiving push data. If setting to `replicate`, it works for replicate server or client of worker replicating data to peer worker. If setting to `fetch`, it works for worker fetch server.
celeborn.<module>.io.receiveBuffer	0b	false	Receive buffer size (SO_RCVBUF). Note: the optimal size for receive buffer and send buffer should be latency * network_bandwidth. Assuming latency = 1ms, network_bandwidth = 10Gbps buffer size should be ~ 1.25MB. If setting to `rpc`, it works for shuffle client, master or worker. If setting to `data`, it works for shuffle client push and fetch data. If setting to `push`, it works for worker receiving push data. If setting to `replicate`, it works for replicate server or client of worker replicating data to peer worker. If setting to `fetch`, it works for worker fetch server.	0.2.0
celeborn.<module>.io.retryWait	5s	false	Time that we will wait in order to perform a retry after an IOException. Only relevant if maxIORetries > 0. If setting to `data`, it works for shuffle client push and fetch data. If setting to `push`, it works for Flink shuffle client push data.	0.2.0
celeborn.<module>.io.saslTimeout	30s	false	Timeout for a single round trip of auth message exchange, in milliseconds.	0.5.0
celeborn.<module>.io.sendBuffer	0b	false	Send buffer size (SO_SNDBUF). If setting to `rpc`, it works for shuffle client, master or worker. If setting to `data`, it works for shuffle client push and fetch data. If setting to `push`, it works for worker receiving push data. If setting to `replicate`, it works for replicate server or client of worker replicating data to peer worker. If setting to `fetch`, it works for worker fetch server.	0.2.0
celeborn.<module>.io.serverThreads	0	false	Number of threads used in the server thread pool. Default to 0, which is 2x#cores. If setting to `rpc`, it works for master or worker. If setting to `push`, it works for worker receiving push data. If setting to `replicate`, it works for replicate server of worker replicating data to peer worker. If setting to `fetch`, it works for worker fetch server.
celeborn.<module>.push.timeoutCheck.interval	5s	false	Interval for checking push data timeout. If setting to `data`, it works for shuffle client push data. If setting to `push`, it works for Flink shuffle client push data. If setting to `replicate`, it works for replicate client of worker replicating data to peer worker.	0.3.0
celeborn.<module>.push.timeoutCheck.threads	4	false	Threads num for checking push data timeout. If setting to `data`, it works for shuffle client push data. If setting to `push`, it works for Flink shuffle client push data. If setting to `replicate`, it works for replicate client of worker replicating data to peer worker.	0.3.0
celeborn.<role>.rpc.dispatcher.threads	<value of celeborn.rpc.dispatcher.threads>	false	Threads number of message dispatcher event loop for roles
celeborn.io.maxDefaultNettyThreads	64	false	Max default netty threads	0.3.2
celeborn.network.bind.preferIpAddress	true	false	When `ture`, prefer to use IP address, otherwise FQDN. This configuration only takes effects when the bind hostname is not set explicitly, in such case, Celeborn will find the first non-loopback address to bind.	0.3.0
celeborn.network.connect.timeout	10s	false	Default socket connect timeout.	0.2.0
celeborn.network.memory.allocator.numArenas	<undefined>	false	Number of arenas for pooled memory allocator. Default value is Runtime.getRuntime.availableProcessors, min value is 2.	0.3.0
celeborn.network.memory.allocator.verbose.metric	false	false	Whether to enable verbose metric for pooled allocator.	0.3.0
celeborn.network.timeout	240s	false	Default timeout for network operations.	0.2.0
celeborn.port.maxRetries	1	false	When port is occupied, we will retry for max retry times.	0.2.0
celeborn.rpc.askTimeout	60s	false	Timeout for RPC ask operations. It's recommended to set at least `240s` when `HDFS` is enabled in `celeborn.storage.activeTypes`	0.2.0
celeborn.rpc.connect.threads	64	false		0.2.0
celeborn.rpc.dispatcher.threads	0	false	Threads number of message dispatcher event loop. Default to 0, which is availableCore.	0.3.0	celeborn.rpc.dispatcher.numThreads
celeborn.rpc.io.threads	<undefined>	false	Netty IO thread number of NettyRpcEnv to handle RPC request. The default threads number is the number of runtime available processors.	0.2.0
celeborn.rpc.lookupTimeout	30s	false	Timeout for RPC lookup operations.	0.2.0
celeborn.shuffle.io.maxChunksBeingTransferred	<undefined>	false	The max number of chunks allowed to be transferred at the same time on shuffle service. Note that new incoming connections will be closed when the max number is hit. The client will retry according to the shuffle retry configs (see `celeborn.<module>.io.maxRetries` and `celeborn.<module>.io.retryWait`), if those limits are reached the task will fail with fetch failure.	0.2.0

11 KiB Raw Blame History

11 KiB

Raw Blame History