### What changes were proposed in this pull request?
This PR aims to make network local address binding support both IP and FQDN strategy.
Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-`
### Why are the changes needed?
Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting.
For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause
```
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874)
at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735)
at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827)
at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140)
at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192)
at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192)
at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
```
### Does this PR introduce _any_ user-facing change?
Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior.
### How was this patch tested?
Manually testing with `celeborn.network.bind.preferIpAddress=false`
```
Server: 10.178.96.64
Address: 10.178.96.64#53
Name: celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.143.252
Server: 10.178.96.64
Address: 10.178.96.64#53
Name: celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.173.94
Server: 10.178.96.64
Address: 10.178.96.64#53
Name: celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.149.42
starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303.
```
Closes #1622 from pan3793/CELEBORN-713.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
82 lines
2.9 KiB
XML
82 lines
2.9 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
~ Licensed to the Apache Software Foundation (ASF) under one or more
|
|
~ contributor license agreements. See the NOTICE file distributed with
|
|
~ this work for additional information regarding copyright ownership.
|
|
~ The ASF licenses this file to You under the Apache License, Version 2.0
|
|
~ (the "License"); you may not use this file except in compliance with
|
|
~ the License. You may obtain a copy of the License at
|
|
~
|
|
~ http://www.apache.org/licenses/LICENSE-2.0
|
|
~
|
|
~ Unless required by applicable law or agreed to in writing, software
|
|
~ distributed under the License is distributed on an "AS IS" BASIS,
|
|
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
~ See the License for the specific language governing permissions and
|
|
~ limitations under the License.
|
|
-->
|
|
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
|
|
<modelVersion>4.0.0</modelVersion>
|
|
|
|
<parent>
|
|
<groupId>org.apache.celeborn</groupId>
|
|
<artifactId>celeborn-parent_${scala.binary.version}</artifactId>
|
|
<version>${project.version}</version>
|
|
<relativePath>../pom.xml</relativePath>
|
|
</parent>
|
|
|
|
<artifactId>celeborn-client_${scala.binary.version}</artifactId>
|
|
<packaging>jar</packaging>
|
|
<name>Celeborn Client</name>
|
|
|
|
<dependencies>
|
|
<dependency>
|
|
<groupId>org.apache.celeborn</groupId>
|
|
<artifactId>celeborn-common_${scala.binary.version}</artifactId>
|
|
<version>${project.version}</version>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.apache.celeborn</groupId>
|
|
<artifactId>celeborn-common_${scala.binary.version}</artifactId>
|
|
<version>${project.version}</version>
|
|
<type>test-jar</type>
|
|
<scope>test</scope>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>io.netty</groupId>
|
|
<artifactId>netty-all</artifactId>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>com.google.guava</groupId>
|
|
<artifactId>guava</artifactId>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.lz4</groupId>
|
|
<artifactId>lz4-java</artifactId>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>com.github.luben</groupId>
|
|
<artifactId>zstd-jni</artifactId>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.apache.commons</groupId>
|
|
<artifactId>commons-lang3</artifactId>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.mockito</groupId>
|
|
<artifactId>mockito-core</artifactId>
|
|
<scope>test</scope>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.apache.logging.log4j</groupId>
|
|
<artifactId>log4j-slf4j-impl</artifactId>
|
|
<scope>test</scope>
|
|
</dependency>
|
|
<dependency>
|
|
<groupId>org.apache.logging.log4j</groupId>
|
|
<artifactId>log4j-1.2-api</artifactId>
|
|
<scope>test</scope>
|
|
</dependency>
|
|
</dependencies>
|
|
</project>
|