celeborn/docs/configuration/network.md
Chandni Singh c8b5384baf [CELEBORN-1107] Make the max default number of netty threads configurable
### What changes were proposed in this pull request?
This change makes the maximum default number of Netty threads configurable. Previously, this value was hardcoded to 64, which could be small for certain environments. While it's possible to configure the number of Netty server and client threads individually for each module, providing an option to increase the default value offers greater convenience.

### Why are the changes needed?
The change offers convenience.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a UT

Closes #2065 from otterc/CELEBORN-1107.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-11-03 13:18:44 +08:00

6.7 KiB

license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Key Default Description Since
celeborn.<module>.fetch.timeoutCheck.interval 5s Interval for checking fetch data timeout. It only support setting to data since it works for shuffle client fetch data and should be configured on client side. 0.3.0
celeborn.<module>.fetch.timeoutCheck.threads 4 Threads num for checking fetch data timeout. It only support setting to data since it works for shuffle client fetch data and should be configured on client side. 0.3.0
celeborn.<module>.heartbeat.interval 60s The heartbeat interval between worker and client. If setting to data, it works for shuffle client push and fetch data and should be configured on client side. If setting to replicate, it works for worker replicate data to peer worker and should be configured on worker side. 0.3.0
celeborn.<module>.io.backLog 0 Requested maximum length of the queue of incoming connections. Default 0 for no backlog.
celeborn.<module>.io.clientThreads 0 Number of threads used in the client thread pool. Default to 0, which is 2x#cores.
celeborn.<module>.io.connectTimeout <value of celeborn.network.connect.timeout> Socket connect timeout.
celeborn.<module>.io.connectionTimeout <value of celeborn.network.timeout> Connection active timeout.
celeborn.<module>.io.enableVerboseMetrics false Whether to track Netty memory detailed metrics. If true, the detailed metrics of Netty PoolByteBufAllocator will be gotten, otherwise only general memory usage will be tracked.
celeborn.<module>.io.lazyFD true Whether to initialize FileDescriptor lazily or not. If true, file descriptors are created only when data is going to be transferred. This can reduce the number of open files.
celeborn.<module>.io.maxRetries 3 Max number of times we will try IO exceptions (such as connection timeouts) per request. If set to 0, we will not do any retries.
celeborn.<module>.io.mode NIO Netty EventLoopGroup backend, available options: NIO, EPOLL.
celeborn.<module>.io.numConnectionsPerPeer 1 Number of concurrent connections between two nodes.
celeborn.<module>.io.preferDirectBufs true If true, we will prefer allocating off-heap byte buffers within Netty.
celeborn.<module>.io.receiveBuffer 0b Receive buffer size (SO_RCVBUF). Note: the optimal size for receive buffer and send buffer should be latency * network_bandwidth. Assuming latency = 1ms, network_bandwidth = 10Gbps buffer size should be ~ 1.25MB. 0.2.0
celeborn.<module>.io.retryWait 5s Time that we will wait in order to perform a retry after an IOException. Only relevant if maxIORetries > 0. 0.2.0
celeborn.<module>.io.sendBuffer 0b Send buffer size (SO_SNDBUF). 0.2.0
celeborn.<module>.io.serverThreads 0 Number of threads used in the server thread pool. Default to 0, which is 2x#cores.
celeborn.<module>.push.timeoutCheck.interval 5s Interval for checking push data timeout. If setting to data, it works for shuffle client push data and should be configured on client side. If setting to replicate, it works for worker replicate data to peer worker and should be configured on worker side. 0.3.0
celeborn.<module>.push.timeoutCheck.threads 4 Threads num for checking push data timeout. If setting to data, it works for shuffle client push data and should be configured on client side. If setting to replicate, it works for worker replicate data to peer worker and should be configured on worker side. 0.3.0
celeborn.<role>.rpc.dispatcher.threads <value of celeborn.rpc.dispatcher.threads> Threads number of message dispatcher event loop for roles
celeborn.io.maxDefaultNettyThreads 64 Max default netty threads 0.3.2
celeborn.network.bind.preferIpAddress true When ture, prefer to use IP address, otherwise FQDN. This configuration only takes effects when the bind hostname is not set explicitly, in such case, Celeborn will find the first non-loopback address to bind. 0.3.0
celeborn.network.connect.timeout 10s Default socket connect timeout. 0.2.0
celeborn.network.memory.allocator.numArenas <undefined> Number of arenas for pooled memory allocator. Default value is Runtime.getRuntime.availableProcessors, min value is 2. 0.3.0
celeborn.network.memory.allocator.verbose.metric false Weather to enable verbose metric for pooled allocator. 0.3.0
celeborn.network.timeout 240s Default timeout for network operations. 0.2.0
celeborn.port.maxRetries 1 When port is occupied, we will retry for max retry times. 0.2.0
celeborn.rpc.askTimeout 60s Timeout for RPC ask operations. It's recommended to set at least 240s when HDFS is enabled in celeborn.storage.activeTypes 0.2.0
celeborn.rpc.connect.threads 64 0.2.0
celeborn.rpc.dispatcher.threads 0 Threads number of message dispatcher event loop. Default to 0, which is availableCore. 0.3.0
celeborn.rpc.io.threads <undefined> Netty IO thread number of NettyRpcEnv to handle RPC request. The default threads number is the number of runtime available processors. 0.2.0
celeborn.rpc.lookupTimeout 30s Timeout for RPC lookup operations. 0.2.0
celeborn.shuffle.io.maxChunksBeingTransferred <undefined> The max number of chunks allowed to be transferred at the same time on shuffle service. Note that new incoming connections will be closed when the max number is hit. The client will retry according to the shuffle retry configs (see celeborn.<module>.io.maxRetries and celeborn.<module>.io.retryWait), if those limits are reached the task will fail with fetch failure. 0.2.0