### What changes were proposed in this pull request?
This PR aims to make network local address binding support both IP and FQDN strategy.
Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-`
### Why are the changes needed?
Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting.
For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause
```
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874)
at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735)
at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827)
at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140)
at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192)
at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192)
at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
```
### Does this PR introduce _any_ user-facing change?
Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior.
### How was this patch tested?
Manually testing with `celeborn.network.bind.preferIpAddress=false`
```
Server: 10.178.96.64
Address: 10.178.96.64#53
Name: celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.143.252
Server: 10.178.96.64
Address: 10.178.96.64#53
Name: celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.173.94
Server: 10.178.96.64
Address: 10.178.96.64#53
Name: celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.149.42
starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303.
```
Closes #1622 from pan3793/CELEBORN-713.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
152 lines
6.1 KiB
Markdown
152 lines
6.1 KiB
Markdown
---
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
# Deploy Celeborn on Kubernetes
|
|
|
|
Celeborn currently supports rapid deployment by using helm.
|
|
|
|
## Before Deploy
|
|
|
|
1. You should have a Running Kubernetes Cluster.
|
|
2. You should understand simple Kubernetes deploy related,
|
|
e.g. [Kubernetes Resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/).
|
|
3. You have
|
|
enough [permissions to create resources](https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/).
|
|
4. Installed [Helm](https://helm.sh/docs/intro/install/).
|
|
|
|
## Deploy
|
|
|
|
### 1. Get Celeborn Binary Package
|
|
|
|
You can find released version of Celeborn on https://celeborn.apache.org/download/.
|
|
|
|
Of course, you can build binary package from master branch or your own branch by using `./build/make-distribution.sh` in
|
|
source code.
|
|
|
|
Anyway, you should unzip and into binary package.
|
|
|
|
### 2. Modify Celeborn Configurations
|
|
|
|
> Notice: Celeborn Charts Template Files is in the experimental instability stage, the subsequent optimization will be
|
|
> adjusted.
|
|
|
|
The configuration in `./charts/celeborn/values.yaml` you should focus on modifying is:
|
|
|
|
* image repository - Get images from which repository
|
|
* image tag - Which version of image to use
|
|
* masterReplicas - Number of celeborn master replicas
|
|
* workerReplicas - Number of celeborn worker replicas
|
|
* volumes - How and where to mount volumes
|
|
(For more information, [Volumes](https://kubernetes.io/docs/concepts/storage/volumes))
|
|
|
|
### [Optional] Build Celeborn Docker Image
|
|
|
|
Maybe you want to make your own celeborn docker image, you can use `docker build . -f docker/Dockerfile` in Celeborn
|
|
Binary.
|
|
|
|
### 3. Helm Install Celeborn Charts
|
|
|
|
More details in [Helm Install](https://helm.sh/docs/helm/helm_install/)
|
|
|
|
```
|
|
cd ./charts/celeborn
|
|
|
|
helm install celeborn -n <namespace> .
|
|
```
|
|
|
|
### 4. Check Celeborn
|
|
|
|
After the above operation, you should be able to find the corresponding Celeborn Master/Worker
|
|
by `kubectl get pods -n <namespace>`
|
|
|
|
Etc.
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
celeborn-master-0 1/1 Running 0 1m
|
|
...
|
|
celeborn-worker-0 1/1 Running 0 1m
|
|
...
|
|
```
|
|
|
|
Given that Celeborn Master/Worker Pod takes time to start, you can see the following phenomenon:
|
|
|
|
```
|
|
** server can't find celeborn-master-0.celeborn-master-svc.default.svc.cluster.local: NXDOMAIN
|
|
|
|
waiting for master
|
|
Server: 172.17.0.10
|
|
Address: 172.17.0.10#53
|
|
|
|
...
|
|
|
|
Name: celeborn-master-0.celeborn-master-svc.default.svc.cluster.local
|
|
Address: 10.225.139.80
|
|
|
|
Server: 172.17.0.10
|
|
Address: 172.17.0.10#53
|
|
|
|
starting org.apache.celeborn.service.deploy.master.Master, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.master.Master-1-celeborn-master-0.out
|
|
|
|
...
|
|
|
|
23/03/23 14:10:56,081 INFO [main] RaftServer: 0: start RPC server
|
|
23/03/23 14:10:56,132 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1] REGISTERED
|
|
23/03/23 14:10:56,132 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1] BIND: 0.0.0.0/0.0.0.0:9872
|
|
23/03/23 14:10:56,134 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1, L:/0:0:0:0:0:0:0:0:9872] ACTIVE
|
|
23/03/23 14:10:56,135 INFO [JvmPauseMonitor0] JvmPauseMonitor: JvmPauseMonitor-0: Started
|
|
23/03/23 14:10:56,208 INFO [main] Master: Metrics system enabled.
|
|
23/03/23 14:10:56,216 INFO [main] HttpServer: master: HttpServer started on port 9098.
|
|
23/03/23 14:10:56,216 INFO [main] Master: Master started.
|
|
```
|
|
|
|
### 5. Access Celeborn Service
|
|
|
|
The Celeborn Master/Worker nodes deployed via official Helm charts run as [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/),
|
|
it can be accessed through Pod IP or [Stable Network ID (DNS name)](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#stable-network-id),
|
|
in above case, the Master/Worker nodes can be accessed through:
|
|
|
|
```
|
|
celeborn-master-0.celeborn-master-svc.default.svc.cluster.local`
|
|
...
|
|
celeborn-worker-0.celeborn-worker-svc.default.svc.cluster.local`
|
|
...
|
|
```
|
|
|
|
After a restart, the StatefulSet Pod IP changes but the DNS name remains, this is important for rolling upgrade.
|
|
|
|
When bind address is not set explicitly, Celeborn worker is going to find the first non-loopback address to bind. By default,
|
|
it use IP address both for address binding and registering, that causes the Master and Client use the IP address to access the
|
|
Worker, it's problematic after Worker restart as explained above, especially when Graceful Shutdown is enabled.
|
|
|
|
You may want to set `celeborn.network.bind.preferIpAddress=false` to address such issue. Note that, depends on your Kubernetes
|
|
network infrastructure, this may cause pressure on DNS service or other network issues compared with using IP address directly.
|
|
|
|
### 6. Build Celeborn Client
|
|
|
|
Here, without going into detail on how to configure spark/flink to find celeborn master/worker, mention the key
|
|
configuration:
|
|
|
|
```
|
|
spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.<namespace>:9097,celeborn-master-1.celeborn-master-svc.<namespace>:9097,celeborn-master-2.celeborn-master-svc.<namespace>:9097
|
|
```
|
|
|
|
You can find why config endpoints such way
|
|
in [Kubernetes DNS for Service And Pods](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/)
|
|
|
|
> Notice: You should ensure that Spark/Flink can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned
|
|
> above
|